Methods and compositions comprising Renilla GFP

ABSTRACT

The invention relates to methods and compositions utilizing  Renilla  green fluorescent proteins (rGFP), and  Ptilosarcus  green fluorescent proteins (pGFP). In particular, the invention relates to the use of  Renilla  GFP or  Ptilosarcus  GFP proteins as reporters for cell assays, particularly intracellular assays, including methods of screening libraries using rGFP or pGFP.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 119(e)/120 of U.S.Provisional Application No. 60/290,287, filed May 10, 2001, and of U.S.application Ser. No. 09/710,058, filed Nov. 10, 2000, which claims thebenefit of U.S. Provisional Application No. 60/164,592, filed Nov. 10,1999.

FIELD OF THE INVENTION

The invention relates to methods and compositions using Renilla greenfluorescent proteins (rGFP) and Ptilosarcus green fluorescent proteins(pGFP). In particular, the invention relates to the use of rGFP or pGFPproteins as reporters for cell assays, particularly intracellularassays, including methods of screening libraries using rGFP and pGFP.

BACKGROUND OF THE INVENTION

The field of biomolecule screening for biologically and therapeuticallyrelevant compounds is rapidly growing. Relevant biomolecules that havebeen the focus of such screenings include chemical libraries, nucleicacid libraries, and peptide libraries in search for molecules thateither inhibit or augment the biological activity of identified targetmolecules. With particular regard to peptide libraries, the isolation ofpeptide inhibitors of targets and the identification of formal bindingpartners of targets has been a key focus. However, one particularproblem with peptide libraries is the difficulty of assessing whetherany particular peptide has been expressed, and at what level, prior todetermining whether the peptide has a biological effect.

The green fluorescent protein from Aequorea victoria (hereinafter“aGFP”) is a 238 amino acid protein displaying autofluorescentproperties. The crystal structure of the protein and several pointmutants has been solved (Ormo, M. et al. (1996) Science 273: 1392–95;Yang. F. et al. (1996) Nature Biotechnol. 14: 1246–51). The fluorophore,consisting of a modified tripeptide, is buried inside a relatively rigidβ-can structure, where it is almost completely protected from solventaccess. The protein fluorescence is sensitive to a number of pointmutations (Phillips, G. N. (1997) Curr. Opin. Struct. Biol. 7: 821–27).Since any disruption of the structure allowing solvent access to thefluorophoric tripeptide results in fluorescence quenching, thefluorescence appears to be a sensitive indication of the preservation ofthe native structure of the protein.

Uses of GFP as a biological marker, such as gene expression, proteintargeting, protein interactions, and biosensors, are well known. Theextensively examined aGFP folds efficiently at or below roomtemperature, but fails to fold properly at higher temperatures.Aggregation of the protein appears to occur when overexpressed incertain organisms, resulting in weak fluorescence. In addition, thefluorescence of the native aGFP has a low quantum yield, which hasprompted a search for variants of aGFP with improved stability andfluorescence properties. Although expression of aGFP is generallynon-toxic to the cell in which it is expressed, there is some suggestionthat aGFP is cytotoxic and may induce apoptosis in expressing cells(Liu, H. S. et al. (1999) Biochem. Biophys. Res. Commun. 260: 712–17).Finally, aGFP has been used as a scaffold for peptide display. However,some peptide insertions at the surface loops of aGFP result in lowfluorescence, which suggests that aGFP may be sensitive to structuralperturbations.

In view of the physical and biological properties of aGFP, other formsof GFPs are desirable with fluorescence and stability characteristicsdifferent from aGFP. Green fluorescent proteins have been cloned fromRenilla reniformis (hereinafter “rrGFP”), Renilla muelleri (hereinafter“rmGFP”), and Ptilosarcus gurneyi (hereinafter “pGFP”) (see WO 99/49019,hereby expressly incorporated by reference). The core chromophoresequence of the rGFP and pGFPs is different from aGFP, and the Renillaforms have fluorescence characteristics with higher molar absorbancecoefficient and narrower absorption/emission spectra as compared to aGFP(Ward, W. W. et al. (1979) J. Biol. Chem. 254: 781–88). The lack ofsignificant homology to aGFP suggests that Renilla and Ptilosarcus formsprovide important alternatives to the extensively exploited aGFP.Accordingly, it is the object of the present invention to providecompositions and methods comprising rGFP and pGFP.

SUMMARY OF THE INVENTION

In accordance with the objects outlined above, the present inventionprovides retroviral vectors comprising a promoter and a rGFP and/or apGFP nucleic acid. Additional nucleic acid vectors embodied by thisinvention comprise a first gene of interest, a separation site, and asecond gene of interest, wherein the first or second gene of interest isa rGFP or pGFP gene. The separation site may be an IRES element, a Type2A sequence, or a protease recognition sequence. The gene of interestmay comprise reporter genes, selection genes, cDNAs, genomic DNAs, orrandom peptides.

In a preferred embodiment, the rmGFP or pGFP used in the vectors arecodon optimized for expression. That is, the rmGFP or pGFP are variantscontaining the preferred codons used in the cells or organism in whichthe rmGFP or pGFP are to be expressed. In a preferred embodiment, thermGFP or pGFP is codon optimized for expression in mammalian cells, mostpreferably in human cells.

In another preferred embodiment, the present invention provides forfusions of a gene of interest and a gene encoding rmGFP or pGFP. Thegene of interest may comprise cDNA, genomic DNA, or a nucleic acidencoding a random peptide. In a preferred embodiment, the codons areoptimized for expression as described above.

In a further preferred embodiment, the fusion nucleic acids comprise alibrary of fusion nucleic acids. That is, in one aspect, each member ofthe library may comprise a promoter, gene of interest, a separationsequence, and a second gene of interest, wherein the first or secondgene of interest comprises a rGFP or pGPF. In another aspect, thelibrary may comprise fusions of a gene of interest and a gene encodingcodon optimized rmGFP or pGFP. The present invention also provides forcells and libraries of cells comprising either these types of fusionnucleic acids.

In a preferred embodiment, the present invention also provides formethods of screening for bioactive agents capable of altering a cellphenotype. The methods comprise contacting a cell or a plurality ofcells comprising a fusion nucleic acid comprising a promoter and a codonoptimized rmGFP or pGFP with at least one candidate agent, and screeningthe cells for an altered phenotype. Alternatively, the cells comprise afusion nucleic acid comprising a promoter, rGFP or pGFP, a separationsequence, and a gene of interest.

In a preferred embodiment, the present invention provides a method ofscreening for bioactive agents capable of inhibiting or activating apromoter. The method of screening comprises first combining a candidatebioactive agent and a cell comprising a fusion nucleic acid comprising apromoter of interest and a nucleic acid encoding either rGFP or pGFP,then optionally inducing the promoter and detecting the presence of saidrGFP or pGFP protein. In another aspect, the promoter is operably linkedto a fusion nucleic acid comprising a rGFP or pGFP, a separationsequence, and a gene of interest. The gene of interest may comprise areporter gene, a selection gene, or a nucleic acid encoding a dominanteffect protein.

In a further preferred embodiment, the method comprises screening foragents inhibiting or activating an IL-4 inducible ε promoter. The methodcomprises first combining a candidate bioactive agent with a cellcomprising a fusion nucleic acid comprising an IL-4 inducible ε promoteroperably linked to the fusion nucleic acids described above; inducingsaid promoter with IL-4; and then detecting the presence of said rGFP orpGFP protein. The absence of said rGFP or pGFP protein indicates whetheran agent inhibits the IL-4 inducible ε promoter.

The methods of screening for candidate agents altering a cell phenotypefurther comprises isolating the cell with the altered phenotype andidentifying the candidate agent responsible for producing the alteredphenotype.

DETAILED DESCRIPTION OF THE FIGURES

FIG. 1 SEQ ID NOS: 3–9 shows an alignment of amino acid sequences ofanthozoan GFPs with the Aequoria sequence using ClustalW program. TheRenilla muelleri (RENM) and Ptilosarcus gurneyi (PTIL) sequences (SEQ IDNOS: 7–8) are shown above the Aequoria GFP (AEQV) sequence (SEQ ID NO:9) at the bottom. The italicized residues are the fluorescent tripeptide(chromophore). The sequences of the four Anthozoan GFPs that emit lightbetween 483–506 nm are from Matz, M. et al. (1999) Nature Biotech. 17:969–973: ANEM, Anemonia majano GFP (SEQ ID NO: 4); DSFP, Discosomastriata GFP; FP48 (SEQ ID NO: 5), Clavularia GFP (SEQ ID NO: 6); andZFP5, Zoanthus GFP (SEQ ID NO: 3). The first 35 residues are removedfrom the amino terminus of FP48. A consensus residue (CONS) was listedif at least 4 of the 7 residues were identical. Residues comprisingturns and loops between the β-strands in the Aequoria GFP based onvisual analysis of Aequoria GFP crystal structure (Yang, et al. (1996)Nature Biotechnol. 14: 1246–51) are underlined. The two residues oneither side of the site of the inserted 22 mer peptide in the Renillamuelleri sequence are listed in bold type and designated as loops A–F inbold. The corresponding replacement sites in Aequoria GFP that allowformation of a fluorescent protein (Peelle, B. et al. (2001) Chem. Biol.8: 521–34) are also shown in bold.

FIG. 2 compares the nucleic acid sequence of wild type (SEQ ID NO: 10;wt; lower sequence) Renilla muelleris GFP and the variant sequence codonoptimized (SEQ ID NO: 1; co; upper sequence) for expression in humancells. In the codon optimized variant, 9 of the 239 amino acids are notoptimized for preferred human codons in order to introduce restrictionsites into the coding sequences. The codon optimized sequence has aglycine inserted following the initiating methionine residue to providefurther stability to the expressed rmGFP.

FIG. 3 compares the nucleic acid sequence of wild type (SEQ ID NO: 11;wt; lower sequence) Ptilosarcus gurneyi GFP and a variant sequence codonoptimized (SEQ ID NO: 2; co; upper sequence) for expression in humancells. Similar to the codon optimized Renilla muelleri variant, theoptimized Ptilosarcus GPF has 11 of the 239 amino acids not optimizedfor preferred human codons in order to introduce restriction sites intothe coding sequences. As above, a glycine residue is inserted after theinitiating methionine residue to provide stability to the expressedpGFP.

FIG. 4 shows the circular dichroism (CD) spectra of Aequoria victoria,Renilla muelleri, and Ptilosarcus gurneyi GFPs. CD spectras are taken atpH 7.5 in 10 mM potassium phosphate buffer with 0.1 M potassium fluorideand measured from 200–250 nm: EGFP (open circles), Renilla (open greysquares) and Ptilosarcus (filled squares) GFPs. Deconvolution of thesespectra indicates the secondary structure content of all three GFPs tobe identical.

FIG. 5 shows the thermal denaturation curves for Aequoria victora,Renilla muelleri, and Ptilosarcus gurneyi GFPs as measured by CD. Themost stable protein was Renilla GFP (open circles) with a T_(m) of 86.1°C. followed by EGFP (filled squares) with a T_(m) of 83.7° C. andPtilosarcus GFP (open triangles) with a T_(m) of 80.5° C.

FIG. 6 gives the results of retroviral expression in human cells ofhuman codon optimized Renilla muelleri, Ptilosarcus gurneyi, andAequoria Victoria GFPs. The retroviral constructs were introduced intoJurkat E cells and examined by flow cytometry. FACS plots of wild type(WT) and codon optimized Renilla muelleri GFP (R), Aequoria victoria GFP(E), and flag tagged versions of Ptilosarcus GFP (Pf), Renilla GFP (Rf)and Aequoria GFP (Ef) were obtained 4 days after infection. BothPtilosarcus and Renilla GFPs have higher fluorescence intensities thanAequoria GFP. Uninfected cells are shown off scale due to shift of thedynamic range, ca. 2.6 log units to the left by FL1 compensation on thecytometer. Geometric mean fluorescence values are listed in the upperright corner for each population within the gated region underlined.

FIG. 7 gives FACS analysis of Jurkat E cell expression of Renilla GFPwith a 22mer HA epitope tag inserted into positions A to F. Plots incolumn A are shown with a standard fluorescence scale. For plots inColumn B, FL-1 channel compensation was used to shift the fluorescencedetection range, ca. 2.6 log units, to the left to observe the highlevel of fluorescence. Renilla GFP is shown without insert (R), and withinserts in positions A, B, C, D, E, and F as labeled. Aequoria GFP isshown without an insert (EGFP) and with the same insert in itsequivalent position D (EGFP3). The sites of insertion, A–F are shownunderlined in FIG. 1. The constructs were retrovirally expressed inJurkat E cells and analyzed by FACS 4 days post-infection. The GFPgeometric mean fluorescence values from the gated regions are listed inthe upper right of each plot. D, F, and EGFP3 retain 30–49% of theirrespective parent GFP fluorescence levels. B, C, and E had observablebut much lower levels of fluorescence than the parent Renilla GFP. Theposition A insert has almost no measurable fluorescence abovebackground.

FIG. 8 shows fluorescence micrographs of cells expressing fusionproteins comprising peptides inserted into sites D and F of Renillamuelleri GFP. The fusion proteins were retrovirally expressed in A549cells. Expression of a fusion protein comprising a hemeagglutininepitope (HA) inserted into sites D and F is shown in panels 1 and 2,respectively. Fluorescence occurs throughout the cell. Expression ofNLS-GFP fusion protein, derived from SV40, inserted into site D and F(panels 3 and 4, respectively), results in fluorescence only in thenucleus. The results show that displayed NLS peptide is functional whenpresented as a peptide inserted onto a GFP scaffold and that the GFPmolecule retains its fluorescence.

FIG. 9 (SEQ ID NOS: 12–23) shows various Type 2A separation sequencesuseful in the present invention. These Type 2A sequences are found inaptho- and cardioviral genomes. The general sequence isXXXXXXXXXXLXXDXEXNPGP (SEQ ID NO: 23), where X is any amino acid.Invariant amino acids are shown in bold. Failure of peptide bondformation occurs at the junction between the carboxy terminal glycineand proline (underlined). The 2A sequence also shows a number ofresidues with conserved amino acid substitutions.

DETAILED DESCRIPTION OF THE INVENTION

The present invention is directed to the use of Renilla greenfluorescent protein (hereinafter “rGFP”) in a variety of methods andcompositions that exploit the autofluorescent properties of rGFP. Thesemethods include, but are not limited to, the use of rGFP as a reportermolecule in cell screening assays, including intracellular assays; theuse of rGFP as a scaffold protein for fusions with random peptidelibraries; etc. Similarly, compositions of rGFP are provided, includingconstructs of rGFP such as fusion constructs that include rGFP as areporter gene, retroviral constructs including rGFP and separationsequences, etc. Basically, the invention provides a number of novel usesfor rGFP, similar to those outlined for aGFP in WO 95/07463, herebyincorporated by reference in its entirety. In addition, the invention isalso directed to the use of Ptilosarcus green fluorescent protein, theamino acid sequence of which is shown in FIG. 1 (SEQ ID NO: 7) and isalso depicted in WO 99/49019. It should be noted that while thediscussion below is generally directed to rGFP, pGFP may be used aswell.

In a preferred embodiment, the invention provides compositions includingrGFP. By “Renilla green fluorescent protein” or “rGFP” herein is meant aprotein that has significant homology, as defined herein, to thewild-type Renilla reniformis or Renilla muelleri protein of FIG. 1 (SEQID NO: 8), both of which are described in WO 99/49019, herebyincorporated by reference in its entirety.

In a preferred embodiment, the invention provides compositions includingpGFP. By “Ptilosarcus green fluorescent protein” or “pGFP” herein ismeant a protein that has significant homology, as defined herein, to thewild-type protein Ptilosarcus protein of FIG. 1 (SEQ ID NO: 7), asdescribed in WO 99/49019, hereby incorporated by reference in itsentirety.

A rGFP or pGFP protein of the present invention may be identified inseveral ways. “Protein” in this sense includes proteins, polypeptides,and peptides. A nucleic acid or rGFP protein is initially identified bysubstantial nucleic acid and/or amino acid sequence homology to thesequences shown in FIGS. 1 and 2 (SEQ ID NOS: 1, 3–10). Such homologycan be based upon the overall nucleic acid or amino acid sequence.Similarly, a nucleic acid or pGFP protein is also initially identifiedby substantial nucleic acid and/or amino acid sequence homology to thesequences shown in FIGS. 1 and 3 (SEQ ID NOS: 2, 3–9, 11). And again,such homology can be based upon overall nucleic acid or amino acidsequence.

As used herein, a protein is a “rGFP protein” or “pGFP protein” if theoverall homology of the protein sequence to the respective amino acidsequences shown in FIG. 1 is preferably greater than about 75%, morepreferably greater than about 80%, even more preferably greater thanabout 85%, and most preferably greater than 90%. In some embodiments thehomology will be as high as about 93 to 95 or 98%.

Homology in this context means sequence similarity or identity, withidentity being preferred. This homology will be determined usingstandard techniques known in the art, including, but not limited to, thelocal homology algorithm of Smith and Waterman (1981) Adv. Appl. Math.2:482; by the homology alignment algorithm of Needleman and Wunsch,(1970) J. Mol. Biol. 48:443; by the search for similarity method ofPearson and Lipman (1988) Proc. Natl. Acad. Sci. USA 85:2444; bycomputerized implementations of these algorithms (GAP, BESTFIT, FASTA,and TFASTA in the Wisconsin Genetics Software Package, Genetics ComputerGroup, 575 Science Drive, Madison, Wis.); or the Best Fit sequenceprogram described by Devereux, J. et al. (1984) Nucleic Acids Res. 12:387–95, preferably using the default settings, or by inspection.

In a preferred embodiment, similarity is calculated by FastDB based uponthe following parameters: mismatch penalty of 1.0; gap size penalty of0.33; and joining penalty of 30.0 (“Current methods in Comparison andAnalysis”, Macromolecule Sequencing and Synthesis, selected methods andApplications, pp. 127–149, Alan R. Liss, Inc., 1998). Another example ofa useful algorithm is PILEUP. PILEUP creates a multiple sequencealignment from a group of related sequences using progressive, pairwisealignments. It can also plot a tree showing the clustering relationshipsused to create the alignment. PILEUP uses a simplification of theprogressive alignment method of Feng and Doolittle (1987) J. Mol. Evol.35: 351–60; the method is similar to that described by Higgins and Sharp(1989) CABIOS 5: 151–3. Useful PILEUP parameters include a default gapweight of 3.00, a default gap length weight of 0.10, and weighted endgaps.

An additional example of a useful algorithm is the BLAST algorithm,described in Altschul, S. F. et al. (1990) J. Mol. Biol. 215: 403–10 andKarlin, et al. (1993) Proc. Natl. Acad. Sci. USA 90: 5873–87. Aparticularly useful BLAST program is the WU-BLAST-2 program, which wasobtained from Altschul et al. (1996) Methods Enzymol. 266:460–80;http://blast.wustl/edu/blast/README.html. WU-BLAST-2 uses several searchparameters, most of which are set to the default values. The adjustableparameters are set with the following values: overlap span=1, overlapfraction=0.125, and word threshold (T)=11. The HSP S and HSP S2parameters are dynamic values and are established by the program itselfdepending upon the composition of the particular sequence andcomposition of the particular database against which the sequence ofinterest is being searched; however, the values may be adjusted toincrease sensitivity. A % amino acid sequence identity value isdetermined by the number of matching identical residues divided by thetotal number of residues of the “longer” sequence in the aligned region.The “longer” sequence is the one having the most actual residues in thealigned region (gaps introduced by WU-Blast-2 to maximize the alignmentscore are ignored).

In a similar manner, “percent (%) nucleic acid sequence identity” withrespect to the coding sequence of the polypeptides identified herein isdefined as the percentage of nucleotide residues in a candidate sequencethat are identical with the nucleotide residues in the coding sequenceof the rGFP or pGFP proteins (FIG. 1). A preferred method utilizes theBLASTN module of WU-BLAST-2 set to the default parameters, with overlapspan and overlap fraction set to 1 and 0.125, respectively.

An additional useful algorithm is gapped BLAST as reported by Altschul,S. F. et al. (1997) Nucleic Acids Res. 25:3389–402. Gapped BLAST usesBLOSUM-62 substitution scores; threshold T parameter set to 9; thetwo-hit method to trigger ungapped extensions; charges gap lengths of ka cost of 10+k; X_(u) set to 16; and X_(g) set to 40 for database searchstage and to 67 for the output stage of the algorithms. Gappedalignments are triggered by a score corresponding to ˜22 bits.

The alignment may include the introduction of gaps in the sequences tobe aligned (see FIG. 1). In addition, for sequences which contain eithermore or fewer amino acids than the protein sequences shown in FIG. 1, itis understood that the percentage of homology will be determined basedon the number of homologous amino acids in relation to the total numberof amino acids. Thus, for example, homology of sequences shorter thanthat shown in FIG. 1, as discussed below, will be determined using thenumber of amino acids in the shorter sequence.

The rGFP and pGFP proteins of the present invention may be shorter orlonger than the amino acid sequences shown in FIG. 1. Thus, in apreferred embodiment, included within the definition of rGFP and pGFPproteins are portions or fragments of the sequences depicted herein.Portions or fragments of rGFP or pGFP proteins are considered rGFP orpGFP proteins if a) they share at least one antigenic epitope; b) haveat least the indicated homology; c) preferably have rGFP or pGFPbiological activity, e.g., including, but not limited to,autofluorescence; or d) fold into a stable structure that is similar tothe wild-type rGFP or pGFP structure.

For example, rGFP or pGFP deletion mutants can be made. At theN-terminus, it is known that only the first amino acid of the aGFPprotein may be deleted without loss of fluorescence. At the C-terminusof the aGFP, up to 7 residues can be deleted without loss offluorescence (see Phillips, G. N. et al. (1997) Curr. Opin. Struct.Biol. 7: 821–27). This presumably applies to rGFP and pGFP as well.

In one embodiment, the rGFP or pGFP proteins are derivative or variantrGFP or pGFP proteins. That is, as outlined more fully below, thederivative rGFP or pGFP will contain at least one amino acidsubstitution, deletion or insertion, with amino acid substitutions beingparticularly preferred. The amino acid substitution, insertion ordeletion may occur at any residue within the rGFP or pGFP protein. Thesevariants ordinarily are prepared by site specific mutagenesis ofnucleotides in the DNA encoding the GFP proteins, using cassette or PCRmutagenesis, DNA shuffling mutagenesis, or other techniques well knownin the art, to produce DNA encoding the variant, and thereafterexpressing the DNA in recombinant cells as is known in the art andoutlined herein. However, variant rGFP or pGFP protein fragments havingup to about 100–150 residues may be prepared by in vitro synthesis usingestablished techniques. Amino acid sequence variants are characterizedby the predetermined nature of the variation, a feature that sets themapart from naturally occurring allelic or interspecies variation of therGFP or pGFP protein amino acid sequence. The variants typically exhibitthe same qualitative biological activity as the naturally occurringanalogue, although variants can also be selected which have modifiedcharacteristics, as will be more fully outlined below. That is, in apreferred embodiment, when non-wild-type rGFP or pGFP is used, thederivative preferably has at least 1% of wild-type fluorescence, with atleast about 10% being preferred, at least about 50–60% beingparticularly preferred, and 95% to 98% to 100% being especiallypreferred. In general, what is important is that there is enoughfluorescence to allow sorting and/or detection above background, forexample when using a fluorescence-activated cell sorter (FACS) machine.However, in some embodiments, for example when fusion proteins with rGFPor pGFP are made, it is possible to detect the fusion proteinsnon-fluorescently using, for example, antibodies directed to either anepitope tag (i.e., purification sequence) or to the rGFP or pGFP itself.In this case, the rGFP or pGFP scaffold does not have to be fluorescent,if it can be shown that the rGFP or pGFP is folding correctly and/orreproducibly.

Thus, the rGFP or pGFP may be wild type or variants thereof. Thesevariants fall into one or more of three classes: substitutional,insertional or deletional variants. These variants ordinarily areprepared by site specific mutagenesis of nucleotides in the DNA encodingthe GFP, using cassette or PCR mutagenesis or other techniques wellknown in the art, to produce DNA encoding the variant, and thereafterexpressing the DNA in recombinant cell culture as outlined herein.However, variant protein fragments having up to about 100–150 residuesmay be prepared by in vitro synthesis using established techniques.Amino acid sequence variants are characterized by the predeterminednature of the variation, a feature that sets them apart from naturallyoccurring allelic or interspecies variation of the rGFP or pGFP aminoacid sequences. The variants typically exhibit the same qualitativebiological activity as the naturally occurring analogue, althoughvariants can also be selected which have modified characteristics aswill be more fully outlined below.

While the site or region for introducing an amino acid sequencevariation is predetermined, the mutation per se need not bepredetermined. For example, in order to optimize the performance of amutation at a given site, random mutagenesis may be conducted at thetarget codon or region and the expressed scaffold variants screened forthe optimal combination of desired activity. Techniques for makingsubstitution mutations at predetermined sites in DNA having a knownsequence are well known, for example, M13 primer mutagenesis and PCRmutagenesis. Screening of the mutants is done using assays of scaffoldprotein activities.

Amino acid substitutions are typically of single residues; insertionsusually will be on the order of from about 1 to 20 amino acids, althoughconsiderably larger insertions may be tolerated. Deletions range fromabout 1 to about 20 residues, although in some cases deletions may bemuch larger.

Substitutions, deletions, insertions or any combination thereof may beused to arrive at a final derivative. Generally these changes are doneon a few amino acids to minimize the alteration of the molecule.However, larger changes may be tolerated in certain circumstances. Whensmall alterations in the characteristics of the rGFP or pGFP protein aredesired, substitutions are generally made in accordance with thefollowing table:

TABLE I Original Residue Exemplary Substitutions Ala Ser Arg Lys AsnGln, (His) Asp Glu Cys Ser Gln Asn Glu Asp Gly Pro His Tyr, (Asn), (Gln)Ile Leu, Val Leu Ile, Val Lys Arg, (Gln), (Glu) Met Leu, Ile Phe Tyr,Trp, (Met), (Leu) Ser Thr Thr Ser Trp Tyr, Phe Tyr Trp, Phe Val Ile, Leu

Less favored substitutions are given in parenthesis. Substantial changesin function or immunological identity are made by selectingsubstitutions that are less conservative than those shown in Table I.For example, substitutions may be made that more significantly affectthe structure of the polypeptide backbone in the area of the alterationof the alpha-helical or beta-sheet structure, the charge orhydrophobicity of the molecule at the target site, or the bulk of theside chain. In general, the substitutions expected to produce thegreatest changes in the polypeptide's properties are those in which (a)a hydrophilic residue, e.g., seryl or threonyl, is substituted for (orby) a hydrophobic residue (e.g., leucyl, isoleucyl, phenylalanyl, valylor alanyl); (b) a cysteine or proline is substituted for (or by) anyother residue; (c) a residue having an electropositive side chain (e.g.,lysyl, arginyl, or histidyl) is substituted for (or by) anelectronegative residue (e.g., glutamyl or aspartyl); or (d) a residuehaving a bulky side chain (e.g., phenylalanine) is substituted for (orby) one not having a side chain (i.e., glycine).

As outlined above, the variants typically exhibit the same qualitativebiological activity (i.e., fluorescence) although variants also areselected to modify the characteristics of the rGFP or pGFP protein asneeded.

In a preferred embodiment, specific residues of rGFP or pGFP protein aresubstituted, resulting in proteins with modified characteristics. Suchsubstitutions may occur at one or more residues, with 1–10 substitutionsbeing preferred. Preferred characteristics to be modified include rangeof spectral emission, including shifts in excitation spectrum, emissionspectrum, rate of folding, stability, solubility, expression levels,toxicity, sensitivity to halide ions, and emission intensity. As isknown in the art, there are a number of aGFP variants with desirableproperties, and these may be varied in the corresponding rGFP and pGFPamino acid residues.

In a preferred embodiment, residue 46 of rmGFP, pGFP, and residue 43 ofrrGFP (corresponding to residue 43 of aGFP) is substituted with a Thr orAla.

In a preferred embodiment, residue 68 of rm GFP, pGFP and residue 65 ofrrGFP (corresponding to residue 64 of aGFP) is substituted with a Leu orVal.

In a preferred embodiment, residue 69 of rmGFP, pGFP, and residue 66 ofrrGFP (corresponding to residue 65 of aGFP) is substituted with a Thr,Ile, Cys, Ser, Leu, Ala or Gly.

In a preferred embodiment, residue 70 rmGFP, pGFP, and residue 67 ofrrGFP (corresponding to residue 66 of aGFP) is substituted with a His,Phe, or Trp.

In a preferred embodiment, residue 72 of rmGFP, pGFP, and residue 69 ofrrGFP (corresponding to residue 68 of aGFP) is substituted with a Val orLeu.

In a preferred embodiment, residue 76 of rmGFP, pGFP, and residue 73 ofrrGFP (corresponding to residue 72 of aGFP) is substituted with a Ser orAla.

In a preferred embodiment, residue 101 of rmGFP, pGFP, and residue 98 ofrrGFP (corresponding to residue 99 of aGFP) is substituted with a Phe orSer.

In a preferred embodiment, residue 125 of rmGFP and pGFP, and residue124 of rrGFP (corresponding to residue 123 of aGFP) is substituted withan Ile.

In a preferred embodiment, residue 147 rmGFP and pGFP, and residue 146of rrGFP (corresponding to residue 145 of aGFP) is substituted with aTyr, Phe or His.

In a preferred embodiment, residue 148 of rGFP and pGFP, and residue 147of rrGFP (corresponding to residue 146 of aGFP) is substituted with anAsn or Ile.

In a preferred embodiment, residue 150 of rmGFP and pGFP, and residue149 of rrGFP (corresponding to residue 148 of aGFP) is substituted witha His or Arg.

In a preferred embodiment, residue 155 of rGFP and pGFP, and residue 154of rrGFP (corresponding to residue 153 of aGFP) is substituted with aThr or Ala.

In a preferred embodiment, residue 162 of rmGFP and pGFP, and residue161 of rrGFP (corresponding to residue 163 of aGFP) is substituted witha Val or Ala.

In a preferred embodiment, residue 166 of rmGFP and pGFP, and residue165 of rrGFP (corresponding to residue 167 of aGFP) is substituted withan Ile or Thr.

In a preferred embodiment, residue 200 of rmGFP and pGFP, residue 199 ofrrGFP (corresponding to residue 202 of aGFP) is substituted with an Seror Phe.

In a preferred embodiment, residue 201 of rmGFP and pGFP, and residue200 of rrGFP (corresponding to residue 203 of aGFP) is substituted withan Ile, Thr, or Tyr.

In a preferred embodiment, residue 203 of rmGFP and pGFP, and residue202 of rrGFP (corresponding to residue 205 of aGFP) is substituted withan Ser or Thr.

In a preferred embodiment, residue 210 of rmGFP and pGFP, and residue209 of rrGFP (corresponding to residue 212 of aGFP) is substituted withan Asn or Val.

In a preferred embodiment, residue 218 of rmGFP and pGFP, and residue216 of rrGFP (corresponding to residue 222 of aGFP) is substituted witha Gly or Ser.

In addition, rGFP or pGFP proteins can be made that are longer than thewild-type, for example, by the addition of epitope or purification tags,the addition of other fusion sequences, etc., as is more fully outlinedbelow.

In another preferred embodiment, GFP variants as used herein includeGFPs containing codons replaced with degenerate codons coding for thesame amino acid. This arises from the degeneracy of the genetic codewhere the same amino acids are encoded by alternative codons. Replacingone codon with another degenerate codon changes the nucleotide sequencewithout changing the amino acid residue. An extremely large number ofnucleic acids may be made, all of which encode the GFPs of the presentinvention. Thus, having identified a particular amino acid sequence,those skilled in the art could make any number of different nucleicacids, by simply modifying the sequence of one or more codons in a waywhich does not change the amino acid sequence of the protein. In thisregard, the present invention has specifically contemplated each andevery possible variation of polynucleotides that could be made byselecting combinations based on the possible codon choices, and all suchvariations are to be considered specifically disclosed and equivalent tothe sequences of FIG. 1. It also should be noted that codon optimizationthat results in one or small number of amino acid changes, particularly,conservative changes are also possible.

Changing the codons may be desirable for a variety of situations. Forexample, substitutions with a degenerate codon is useful wheneliminating cryptic splice signals present in the coding regions of agene, inserting restriction sites in the gene, distinguishing betweenone version of the same gene from another (e.g., by hybridization),creating alternative primers for amplification reactions, examiningmutational bias in genes, changing chromosomal methylation patterns(e.g., for determining preferential parental transmission), and changingthe expression levels of the gene of interest.

Accordingly, in a further preferred embodiment, the GFP variants arecodon optimized for expression in a particular organism. By “codonoptimized” herein is meant changes in the codons of the gene of interestto those preferentially used in a particular organism such that the geneis efficiently expressed in the organism. Although the genetic code isdegenerate in that most amino acids are represented by several codons,called synonyms or synonymous codons, it is well known that codon usageby particular organisms is nonrandom and biased towards particular codontriplets. This codon usage bias may be higher in reference to a givengene, genes of common function or ancestral origin, highly expressedproteins versus low copy number proteins, and the aggregate proteincoding regions of an organism's genome. Although codon bias may arisefrom nucleotide composition or mutational biases in different organisms,codon usage bias in bacteria and yeast correlates with the abundance oftRNA species in the cell. In general, codon bias is often associatedwith the level of gene expression. That is, certain codons arepreferentially represented in the protein coding regions of highlyexpressed gene products. Thus, changing the codons to the preferredcodons of a particular organism may allow higher level expression of theencoded protein in that organism. In this regard, the present inventionrelates to GFP variants whose codons are altered to the preferred codonsof the organism in which the gene of interest is being expressed. Inother words, codons are preferably selected to fit the host cell inwhich the protein is being produced. For example, preferred codons usedin bacteria are used to express the gene in bacteria; preferred codonsused in yeast are used for expression in yeast; and preferred codonsused in mammals cells are used for expression in mammalian cells.

By “preferred”, “optimal” or “favored” codons, or “high codon usagebias” or grammatical equivalents as used herein is meant codons used athigher frequency in the protein coding regions than other codons thatcode for the same amino acid. The preferred codons may be determined inrelation to codon usage in a single gene, a set of genes of commonfunction or origin, highly expressed genes, the codon frequency in theaggregate protein coding regions of the whole organism, codon frequencyin the aggregate protein coding regions of related organisms, orcombinations thereof.

In a preferred embodiment, preferred or favored codons are determinedfor genes of common function, while in a more preferred embodiment,preferred codons are determined for protein coding regions of the wholeorganism or related organisms. In a most preferred embodiment, codonusage in a representative number of highly expressed gene products of anorganism or related organisms will provide the basis for determining theset of preferred codons. Thus, in one aspect, preferred codons are thosecodons whose frequency increases with the level of gene expression.Since gene expression may be restricted to specific cells or certaindevelopmental time periods (e.g., embryonic and adult), whether a geneis highly expressed is measured in respect to the cells and the temporalperiods when the gene is expressed.

In another aspect, preferred codons are further delineated with respectto the size of the protein coding regions examined. Studies of codonbias show a negative correlation between the size of the protein andcodon usage (see Duret, L. et al. (1999) Proc. Natl. Acad. Sci. USA 96:4482–87). For proteins of increasing length, there is a tendency forless codon usage bias while highly expressed proteins of decreasinglength display increased codon usage bias. Thus, in a preferredembodiment, the size of proteins used for assessing preferred codonsincludes proteins of all lengths, while a more preferred embodiment usesprotein lengths up to about 550 amino acids. In the most preferredembodiment, proteins lengths of up to about 335 amino acids are used.

A variety of methods are known for determining the codon frequency(e.g., codon usage, relative synonymous codon usage) and codonpreference in specific organisms, including multivariat analysis, forexample, using cluster analysis or correspondence analysis, and theeffective number of codons used in a gene (see GCG CodonPreference,Genetics Computer Group Wisconsin Package; CodonW, John Peden,University of Nottingham; McInerney, J. O (1998) Bioinformatics 14:372–73; Stenico, M. et al. (1994) Nucleic Acids Res. 222437–46; Wright,F. (1990) Gene 87: 23–29). Codon usage tables are available for agrowing list of organisms (see for example, Wada, K. et al. (1992)Nucleic Acids Res. 20: 2111–2118; Nakamura, Y. et al. (2000) NucleicAcids Res. 28: 292; Duret, et al. supra). The data source for obtainingcodon usage may rely on any available nucleotide sequence capable ofcoding for a protein. These data sets include nucleic acid sequencesactually known to encode expressed proteins (e.g., complete proteincoding sequences-CDS), expressed sequence tags (ESTS), or predictedcoding regions of genomic sequences (see for example, Mount, D.Bioinformatics: Sequence and Genome Analysis, Chapter 8, Cold SpringHarbor Laboratory Press, Cold Spring Harbor, N.Y., 2001; Uberbacher, E.C. (1996) Methods Enzymol. 266: 259–281; and Tiwari, S. et al. (1997)Comput. Appl. Biosci. 13 263–270). Accordingly, the present inventionrelates to codon optimization for enhancing expression of a gene in anyhost organism.

In a preferred embodiment, the nucleotide sequence of rGFP or pGFP aresubstituted with codons preferentially used in the organism in which theGFP is to be expressed. In identifying the codons for modification orreplacement, the codons of rGFP or pGFP (or any other protein codingregion) are compared to the codons favored or preferred in the organismof interest. This analysis identifies differences between the preferredset of codons and the codons actually used, and thus identifiesnucleotides for substitutions. In a preferred embodiment, codons in rGFPor pGFP that are the least preferred codons in the subject organism areselected for substitution. Further substitutions are made for frequentlyoccurring codons in rGFP or pGFP that are not the preferred codons.Although the frequently occurring codons may not comprise the leastpreferred codons, presence of numerous non-preferred or nonoptimalcodons can limit efficient expression of the protein product.

When several preferred codons are available for the same amino acid, thechoice of substitution can rely on other considerations such ease ofconstructing the variant, concerns for limiting introduction ofmutations during propagation of the gene in the host organism (i.e.,mutational bias), secondary structure of the mRNA that may affectexpression levels, and concern for generating splice sites. Otherconsiderations may take into account the intended uses of the codonoptimized variants, such as insertion of restriction sites forgenerating fusion proteins. Thus, some deviations from strict adherenceto preferred codons are permissible to accommodate restriction sites inthe resulting gene for the purposes of constructing the variant,replacement of gene segments (e.g., to simplify insertion of mutatedgene segments), and for creating fusion proteins, as described below.

In certain embodiments, all codons need not be replaced to optimize thecodon usage of the GFP since the natural sequence will comprise thepreferred codons and because use of preferred codons may not be requiredfor all amino acid residues. In one aspect, about 10 to about 35% of thecodons are replaced or changed. Additional changes may be introduced tomaximize expression. Consequently, codon optimized GFP sequences maycontain preferred codons at about 40%, 50%, 60%, 70%, 80%, or greaterthan 90% of codon positions of the full length coding region.

Preferred genes of interest are codon optimized for prokaryotes oreukaryotes. Prokaryotes may comprise, among others, bacteria, includingBacillus (for example, subtilis, anthracis), Clostridia, Staphylococcus,Streptococcus, Neisseria, Erysipelothrix, Listeria, Nocardia,Salmonella, Shigella, Escherichia, Klebsiella, Enterobacter, Serratia,Proteus, Morganella, Providencia, Yersinia, Haemophilus, Brucella,Francisella, Vibrio, Pseudomonas, Campylobacter, Clostridium,Actinomyces, Corynebacterium, Bacteroides, Mycobacterium (for example,tuberculosis, leprae); spirochetes, including Trepanoma, Borrelia,Leptospira, and Spirillum; archebacteria, including Methanobacterium,Thermoplasma, Thermophilus, or other thermophiles (e.g., Sulfolobus),and Halobacterium; and cyanobacteria. Eukaryotes may comprise, amongothers, protists, including Mastigophora, Sarcodina, Ciliophora, andSporozoa (trypanosoma); fungi, including Saccharomyces,Schizosaccharomyces, Candida, Neurospora, Aspergillus, Ustilago,Penicillium, and Sordaria; plants, including Chlorophyta andTracheophyta-Angiosperms and Spermopsida (e.g., tobacco, arabidopsis,corn, rice, wheat, tomato, potato, etc.); worms, including nematoda(e.g., Caenorhabditis, Trichinella, Trichuris), platyhelminthes (e.g.,Diphyllobothrium, Clonorchis, and Dugesia (e.g., planaria); insects,including Drosophila, Manduca, Bombyx etc.; amphibia (e.g., Xenopus,newts, salamanders etc.); fish (e.g., salmon, catfish, zebrafish,Xiphophorus, trout, goldfish, tilapia and medaka etc.); aves (e.g.,turkey, chicken, duck, quail, and geese, etc.); mammalia, includingrodentia (e.g., mice, rats, gerbils, hamsters, etc.), legomorpha (e.g.,rabbits, hares), artiodactyla (e.g., cows, pigs, sheep, goats, etc.),canis (e.g., domestic dog), felis (e.g., domestic cat), and primates(e.g., monkeys, chimpanzees, and humans). Codon optimization forexpression in bacteria, yeast, mammalian cells (e.g., rodents, primates,etc.), and in particular human cell types, are most preferred.

Codon preference in the coding regions of human genes is given in TableII. The table shows for each codon the relative frequency of each codonamong synonymous codons. The most preferred codons are given in bold.Methionine and tryptophane have a value of 1 since these residues areencoded by a single codon. For certain amino acids, such as arg, foursynonymous codons are used at similar frequencies.

TABLE II TTT phe F 0.43 TCT ser S 0.18 TAT tyr Y 0.42 TGT cys C 0.42 TTCphe F 0.57 TCC ser S 0.23 TAC tyr Y 0.58 TGC cys C 0.58 TTA leu L 0.06TCA ser S 0.15 TTA och Z — TGA opa Z — TTG leu L 0.12 TCG ser S 0.06 TAGamb Z — TGG trp W 1.00 CTT leu L 0.12 CCT pro P 0.29 CAT his H 0.41 CGTarg R 0.09 CTC leu L 0.20 CCC pro P 0.33 CAC his H 0.59 CGC arg R 0.19CTA leu L 0.07 CCA pro P 0.27 CAA gln Q 0.27 CGA arg R 0.10 CTG leu L0.43 CCG pro P 0.11 CAG gln Q 0.73 CGG arg R 0.19 ATT ile I 0.35 ACT thrT 0.23 AAT asn N 0.44 AGT ser S 0.14 ATC ile I 0.52 ACC thr T 0.38 AACasn N 0.56 AGC ser S 0.25 ATA ile I 0.14 ACA thr T 0.27 AAA lys K 0.40AGA arg R 0.21 ATG met M 1.00 ACG thr T 0.12 AAG lys K 0.60 AGG arg R0.22 GTT val V 0.17 GCT ala A 0.28 GAT asp D 0.44 GGT gly G 0.18 GTC valV 0.25 GCC ala A 0.40 GAC asp D 0.56 GGC gly G 0.33 GTA val V 0.10 GCAala A 0.22 GAA glu E 0.41 GGA gly G 0.26 GTG val V 0.48 GCG ala A 0.10GAG glu E 0.59 GGG gly G 0.23

The codon optimized GFPs are made in accordance with methods well knownin the art. When the substitutions or replacements are not extensive,oligonucleotide directed mutagenesis or other localized mutagenesistechniques, such as replacing fragments of the gene with fragmentscontaining the preferred codons, are used to optimize the codons. Ifcodon optimization is extensive, the GFP gene may be a synthetic genegenerated from overlapping oligonucleotides (Jayaraman, K. et al. (1991)Proc. Natl. Acad. Sci. USA 88: 4084–8; Stemmer, W. P. et al. (1995) Gene164: 49–53). The oligonucleotides may or may not be ligated togetherduring the process for generating the synthetic gene. In this regard,use of polymerase chain reaction of the hybridized overlappingoligonucleotides allows facile generation of these synthetic genes.

In accordance with the present invention, exemplary codon optimizedvariants for expression in human cells is provided by SEQ ID NO:1 forRenilla muelleri GFP and SEQ ID NO: 2 for Ptilosarcus gurneyi GFP (FIGS.2 and 3, respectively). In the codon optimized rmGFP, 9 of the 239 aminoacids are not the preferred human codons in order to accommodaterestriction sites used for constructing various rmGFP fusion proteins.For the codon optimized pGFP, 11 of the 239 amino acids are not thepreferred codons for the same reasons given above. It will beunderstood, however, that the codon optimized sequences of the presentinvention are by no means limited to the representative sequenceprovided herein. In view of the preceding discussion, one of skill inthe art will readily be able to prepare a number of different codonoptimized GFP sequences for expression in a given organism, especiallyfor expression in human cells, or other cells as outlined herein.

In a preferred embodiment, the rGFP or pGFP protein, including variantsis fused to a protein of interest, including peptides as outlinedherein. By “fused” or “operably linked” herein is meant that thepeptide, as defined below, and the rGFP or pGFP protein are linkedtogether. In a preferred embodiment, fusion nucleic acids are made suchthat fusion polypeptides, e.g. a single polypeptide, are made. In analternative embodiment, fusion nucleic acids comprising separation sites(e.g., protease recognition sequences, 2A sequences, or IRES sequences)are made as further described below. In one preferred embodiment, thefusions disrupts the fluorescence characteristic of the rGFP or pGFP.That is, the fluorescence characteristics of the rGFP or pGFP ischanged, including under different solution conditions (e.g,temperature, pH, ion concentration, halide concentration, membranepotential, etc.). In another preferred embodiment, the fusions onlyminimally disrupts stability of rGFP or pGFP. That is, the rGFP or pGFPpreferably retains its fluorescence, or maintains a T_(m) (thermalmelting temperature) of at least 42° C.

In a preferred embodiment, the present invention is also useful inmarking viruses and cells and as reporters for cell proliferation, asfurther illustrated below. General expression or specific regulatedexpression of the fusion proteins marks the cell, either constitutivelyor at specific periods in development. These marked viruses and cellsmay be detected and tracked to determine their migration orproliferation in a organism or in response to specific biologicalsignals, for example cytokines and chemokines. As further describedbelow, these cells may be used in screens to identify candidate agentsthat alter the infectivity, migration or proliferation of these virusesor cells in response to the biological signals.

In a preferred embodiment, the fusions to rGFP or pGFP are used fortracking or localizing the protein to a particular subcellular location;quantitating gene expression; display of peptides; indicator of cellularreactions; markers for cell growth and proliferation, etc. The fusionsmay be made to any protein of interest encoded by any gene of interest.These include genomic DNA, cDNA, protein-interaction domains, targetingsequences (e.g., localization sequences), stability sequences,protein-modification sequences (e.g., phosporylation, ADP ribosylation,lipidation, glycosylation, protease sites, etc.), random peptides,biosensor sequences, as further discussed below. The fusions may be madeto the amino terminal, the carboxy terminal, or internally to the GFPsequence. When the fusions are internal to the rGFP or pGFP, they arepreferably in the internal loops of the fluorescent proteins. In apreferred embodiment, the fusions do not affect the fluorescence, whichallows direct detection of the fusion protein. In another aspect,detecting the fusion protein uses a label that binds the fusion protein,such as a labeled antibody directed against r- or pGFP or the fused geneof interest, in which case the fusion protein need not be fluorescent.As outlined below, the fusion polypeptide (or fusion polynucleotideencoding the fusion polypeptide) can comprise additional components,including multiple peptides at multiple loops, fusion partners, linkers,etc.

In a preferred embodiment, the fusion to rGFP or pGFP, preferably acodon optimized variant, are used to track and localize proteinsintracellular or extracellularly. Fusion may be made to any protein ofinterest to examine cellular processing events of the subject protein.Proteins of interest include cytoskeletal proteins for tracking cellmovement and cell structure; focal adhesion proteins involved in celladherence; nuclear proteins for examining signals involved in nucleartransport; nuclear membrane proteins involved in nuclear membranedissolution and reformation; cell organelle replication and structure;intracellular transport of proteins (e.g. targeting signals);development of structural polarity in cells (e.g., neuronal orepithelial cells); monitoring cell division processes; and the like.Many of these aforementioned process are abnormal in disease cells, suchas cancer cells. These fusion proteins expressed in cells are useful foridentifying candidate agents that affect these biological processes inparticular cell types. Thus, screens may be conducted for agents thatconfer a phenotype similar to a disease cell or for agents that convertan abnormal cell, characterized by an abnormal cellular process, to anormal cell.

In another preferred embodiment, the fusions are made toprotein-modification sequences. These sequences may be a sequencecapable of being modified by any modification process. In a preferredembodiment, the modification sequence is a modified by another proteinor enzyme. In one aspect, the modification sequence comprises aphosphorylation sequence (Yang, F. et al. (1996) Anal. Biochem. 266:167–73). A variety of phosphorylation sequences are known (e.g., srchomology domain SH2 and SH3) and recognized by kinases that attachphosphates to specific amino acids (e.g., serine, threonine, tyrosine,histidine) (see Kreegiopuu, A. et al. (1999) Nucleic Acids Res. 27:237–39). The phosphorylation sequences are fused to GFP to allow correctpresentation to the cognate kinases. Phosphorylation of the sequence mayor may not affect the fluorescence properties of rGFP or pGFP. By“fluorescence properties” herein is meant any detectable change in thefluorescence characteristic of the GFP. This may involve the molarextinction coefficient at the appropriate excitation wavelength,fluorescence quantum yield, excitation and emission spectra, ratio ofexcitation amplitudes at two different wavelengths, ratio of emissionamplitudes at two different wavelengths, excitation lifetime, andfluorescence quenching.

In one preferred embodiment, phosphorylation of the fusion proteins doesnot affect the fluorescence characteristic. In this context, the GFPprovides a scaffold for efficient presentation of the sequence as asubstrate for a kinase. The phosphorylation is detected by directlabeling with labeled nucleotide substrate (e.g., ATP) or reaction withantibodies specific for phosphorylated sequences. In another preferredembodiment, phosphorylation of the fusion protein changes thefluorescence characteristics of the fluorescent protein such that thechange provides an indication of kinase activity (see U.S. Pat. No.6,248,550, expressly incorporated by reference). Generally, the kinasesubstrate fusion protein displays distinguishable properties between thephosphorylated and unphosphorylated states. Measuring the change influorescent characteristic before and after contacting with the kinaseprovides a measure of kinase activity.

In another preferred embodiment, the translocation from one cellularlocation, or the ability to interact with a phosphoprotein bindingdomain, provides another measure of phosphorylation. It is well knownthat phosphorylation of specific sequences alters the interaction of thesequence with a cognate binding partner. Phosphorylation may prevent orenhance these interactions. Thus, phoshorylation is detectable byexamining affinity of the binding partners to the fusion protein or byexamining changes in intracellular location of the rGFP or pGFP fusionpolypeptide (see for example, Durocher, D. et al. (2000) Mol. Cell6:1169–82; Yaffe, M. B. et al. (2001) Structure 9: R33–8).

In another preferred embodiment, the phosphorylation substrates arecandidate substrates comprising library of random peptides, a library ofcDNA fragments, or a library of genomic nucleic acid fragments fused torGFP or pGFP, as discussed below. In one aspect, the library ofcandidate substrates is expressed in a host cell, each of whichexpresses a different candidate substrate. A kinase is contacted withthe fusion proteins, for example by transfecting the cells with a vectorexpressing the kinase or by treating the cells to a condition thatinduces kinase activity. Peptides affecting the GFP fluorescenceproperties or localization of rGFP or pGFP fusion protein substratefollowing treatment with kinase is identified. Sequences producingdetectable changes are isolated and sequenced to determine the putativekinase sequences.

The general approach outlined above are applicable to a variety of otherprotein modification reactions. For example, adenosine diphosphate(ADP)-ribosyltransferases binds nicotinamide adenine dinucleotide (NAD),and catalyzes the transfer of the ADP-ribose moiety to an acceptornucleophile, with cleavage of the glycosidic bond between N-1 of thenicotinamide and C-1 of the adjacent ribose. The modification maycomprise a mono-ADP ribosylation or poly-ADP ribosylation, depending onthe transferase enzyme (Koch-Nolte, F. (2001) J. Biotechnol. 92: 81–87).Bacterial toxins, such as pertussis toxin and cholera toxin, act by ADPribosylating heterotrimeric GTP binding proteins that controlintracellular signaling and vesicle trafficking. Poly-ADP ribosylationappears to play roles in DNA damage recovery, DNA replication, and viralintegration. Thus, the present invention provides for fusion proteinscomprising rGFP or pGFP and ADP ribosylation sites made in the samemanner as that provided for phosphorylation sites. Mono and poly ADPribosylated sequences include, among others, those present onheterotrimeric G proteins (Yamamoto, M. (1993) Oncogene 8: 1449–55;Finck-Barbancon, V. (1995) Biochemistry 34: 1070–75; and vonOlleschik-Elbheim, L. (1997) Adv Exp Med Biol 419: 87–91), muscleprotein desmin (Zhou, H., et al. (1996) Arch. Biochem. Biophys. 334:214–222), poly-ADP ribosylase (Martinez, M. (1991) Biochem Biophys ResCommun. 181: 1412–8), and phosphorylase kinase (Okazaki, I. J. (1996)Adv. Pharmacol. 35: 247–80).

In another preferred embodiment, the fusion proteins comprise rGFP orpGFP fused to protease recognition sequences for detecting proteaseactivity. Biological functions of proteases are well known in the art,including, but not limited to pathogenesis (e.g., polyprotein processingby HIV protease), cell death (e.g., caspases), cell adhesion (e.g.,metalloproteases), and the like. In one aspect, protease recognitionssequences, as further described below, are fused to rGFP or pGFP, orvariants thereof. Cleavage of the fusion protein changes thefluorescence characteristic, which provides a measure of proteaseactivity. In one aspect, the cleavage site is inserted into the rGFP orpGFP. That is, the protease recognition sequence is inserted into theinternal regions of GFP, preferably the surface loops.

In another preferred embodiment, the protease substrates may comprisefusion of rGFP or pGFP to rGFP or pGFP variants or other fluorescentproteins such that fluorescence resonance energy transfer (FRET) ispossible between the two linked fluorescent molecules. Generally,fluorescence resonance energy transfer occurs between two dye moleculesin which excitation is transferred from a donor molecule to an acceptormolecule without emission of a photon. Donor and acceptor molecules mustbe in close proximity (i.e., radial distance within approximately 10 nmof each other) and have their transition dipole orientationsapproximately parallel to each other. For excitation transfer from donorto acceptor to occur, the absorption spectrum of the acceptor mustoverlap the fluorescence emission spectrum of the donor. Suitable pairsof fluorescent molecules capable of undergoing FRET signal may includerGFP or pGFP with BFP (blue fluorescent protein, Heim, R. et al. (1996)Curr. Biol. 6: 178–82), rGFP or pGFP with BFP5 (Mitra, R. D. (1996) Gene173: 13–17), rGFP or pGFP with cyan fluorescent protein (CFP), rGFP orpGFP with Anemonia majano fluorescent protein amFP486 (Matz, M. V.(1999) Nat. Biotechnology 17: 969–73), rGFP or pGFP with Discosomastriata dsFP 483 (Matz, supra), rGFP or pGFP with Clavularia cFP484(Matz, supra), and the like. In these donor-acceptor pairs, the rGFP orpGFP functions as the acceptor. In principle, other donor-acceptor pairsare possible in which rGFP or pGFP serves as the donor to acceptorfluorescent protein variants having excitation and emission peaks ofabout 20 nm or more than the those of rGFP or pGFP. Examples of suitableacceptors include yellow fluorescent protein (i.e., class 4 GFPs, seeTsien, R. (1998) Ann. Rev. Biochem. 67: 509–44), Zoanthus zFP538 (Matz,supra), Discosoma drFP583 (Matz, supra), and the like. The proteaserecognition site is incorporated as part of the linker sequenceconnecting the donor and acceptor GFPs. Cleavage of the linker byproteases results in physical separation of the two fluorescentproteins, thus resulting in loss of FRET. A variety of protease andprotease recognition sequence combinations may be used, as furtherdescribed below. The reactions may occur in vitro by contacting theprotease with a FRET protease substrate. In another aspect, thereactions are done in vivo by expressing the protease substrates in thecell and introducing vectors expressing the protease or inducing theprotease activity by appropriate treatment of the cells. In onepreferred embodiment, the FRET protease substrate and/or protease areintroduced into the cell by retroviral vectors.

Since FRET based reactions provide a basis for monitoring variousbiological processes, FRET using rGFP or pGFP or their variants aseither the donor or acceptor is also applicable for examining variousbiological reactions. In one preferred embodiment, the FRET moleculecomprising rGFP or pGFP, acting as either a donor or acceptor molecule,further comprises a sequence capable of binding an analyte or ligandwhich causes a change in the spatial orientation of the donorfluorescent protein and the acceptor fluorescent protein relative to oneanother (see U.S. Pat. No. 6,197,928, hereby expressly incorporated byreference). In one preferred embodiment, the ligand binding region isfused to the two fluorescent proteins without linkers. In anotherpreferred embodiment, the ligand binding region is fused to thefluorescent proteins by linkers to provide proper spatial orientationbetween the donor and acceptor fluorescent proteins for FRET to occurand to permit binding of ligand to the binding sequence.

Various binding regions may be used with the present invention. Theseinclude calcium binding regions (Romoser, V. A. (1997) J. Biol. Chem.272: 13270–74), protein interaction domains (e.g., phosphoproteinbinding domain), receptors (e.g., Fas), and the like (see U.S. Pat. No.6,197,928). Linkers may comprise glycines or serines or combinationsthereof to prevent structural perturbations between the GFPs (e.g., tocause proper folding of the proteins) and the binding domains. Linkersequences are appropriately positioned to either cause an increase or adecrease in FRET upon binding of ligand to the binding sequence. In oneaspect, various mutant forms of the binding domain may be made tomaximize the range of ligand concentrations capable of being detected invivo or in vitro by FRET. Fusing these fusion proteins to targetingsequences allows measuring the concentration of the analytes withinparticular subcellular compartments. In a preferred embodiment, the GFPsused for FRET and their corresponding binding regions and linkersequences are codon optimized to maximize expression within particularcells, especially mammalian cells. Codon optimization is employedbecause non-optimized forms may not produce sufficient changes in FRETsignal to act as a FRET reporter molecule.

In another preferred embodiment, the FRET based reactions do not use asequence that physically links the donor and acceptor pairs. That is,the donor and acceptor fluorescent fusion proteins exists separately. Inthis preferred embodiment, rGFP or pGFP fusions may be made toprotein-interaction domains. Thus, a first fusion protein comprises afirst protein interaction domain fused to rGFP or pGFP, or theirvariants. A second fusion protein comprises a second protein interactiondomain, which is capable of interacting with the first proteininteraction domain, fused to a fluorescent protein capable of undergoingFRET with rGFP or pGFP. Juxtaposition of the two fluorescent proteinsthrough the protein interaction regions results in a FRET signal. Ingeneral, fused fluorescent proteins separated by a linker provide apositive control for a detectable FRET signal. Conversely, expression ofeach fluorescent protein fused to its cognate protein interaction domainprovides a negative control for determining background signal and therelative signal intensities of the two fluorescent proteins. Cellsexpressing the fusion proteins may be examined in vivo, in vitro, orafter fixation in a chemical fixative (e.g., formaldehyde,paraformaldehyde, glutaraldehyde). Generally, measuring the FRET ratioprovides one basis for determining interaction between the two proteininteraction domains (Miyawaki, A. et al. (2000) Methods Enzymol. 327:472–500). As further described below, the protein interaction domaincomprises any sequence capable of interacting with other molecules,including other proteins, nucleic acids, lipids, carbohydrates, and thelike. The interaction domains may be identical, in which casehomomultimeric interactions may be examined, while in other cases, theinteraction domains are different, in which case heteromultimericinteractions may be examined (Guo, C. et al. (1995) J. Biol Chem 270:27562–68; Mahajan, N. P. (1998) Nat. Biotechnol. 16: 547–52; Ng, E. K.(2002) J. Cell Biochem. 84: 556–66; and Day, R. N. (2001) Methods 25:4–18).

Since fluorescent proteins serve as useful reporters of cellular events,the present invention further relates to fusion proteins comprising rGFPor pGFP fused to various protein interaction domains whose interactionschange depending on the physiological state of the cell. These fusionproteins serves as biosensors, as defined below. Protein interactiondomains whose interactions with binding partners change with differentcellular states are well known in the art. As illustrated below,pleckstrin domains bind specifically to PtdInsP₂, which is released fromthe membrane by action of phospholipases activated by signaltransduction events. Phosphoprotein binding domains (e.g., SH2 domains)interact with specific phosphorylated peptide sequences as part of theirmechanism of signal transduction. The voltage sensing domain of voltagesensitive ion channels (e.g. Shaker potassium channels) shifts withinthe membrane depending on the membrane potential, thus altering thesolution environment of sequences adjacent to the voltage sensor(Siegel, M. S. (1997) Neuron 19: 735–41). In the present invention,rGFP, pGFP or variants thereof are fused to these sequences to generatefusion proteins whose cellular localization or fluorescence propertieschange depending on the physiological state of the cell. Determiningchanges in cellular localization may be done by fluorescence microscopywhile changes in fluorescence may be examined by measuring thefluorescence characteristics at two different cellular states.

In another preferred embodiment, the fusion polypeptides comprise rGFPor pGFP fused to peptides or proteins encoded by cDNA or cDNA fragments.As used herein, cDNA is meant a DNA that is complementary to at least aportion of an RNA, preferably a messenger RNA, and is generallysynthesized from an RNA preparation using reverse transcriptase. Asfurther described below, the cDNA may be full length (i.e.,complementary to the full length RNA) or a partial cDNA, which is lessthan the full length RNA. The cDNA may be a cDNA fragment, which isderived from a larger cDNA by methods described below. Methods forconstructing cDNA libraries from RNA, especially mRNA, are well known inthe art (see Ausubel, F. In Current Protocols in Molecular Biology, JohnWiley & Sons, updated October 2001, Chapter 5, Construction ofRecombinant DNA Libraries, particularly Section III, Preparation ofInsert DNA from Messenger RNA, expressly incorporated by referenceherein). In addition, two commonly used methods of producing cDNA aredescribed in Okayama and Berg, Mol. (1982) Cell Biol. 2: 161–170 andGuber and Hoffman (1983) Gene 25: 263–269. In a preferred embodiment,the cDNAs are inserted into the carboxy or the amino terminal region ofrGFP or pGFP. In another preferred embodiment, cDNA is inserted onto theinternal regions of rGPF or pGFP. Preferably, the insertions do notaffect the fluorescence of the rGFP or pGFP to allow monitoring of cDNAexpression. Fusions to the amino terminal or internal regions of rGFP orpGFP permit identification of cDNAs that are in frame with respect tothe GFP protein as indicated by the expression of fluorescent fusionproteins. Preferably, codon optimized rGFP or pGFP variants are used tomaximize expression of the fusion polypeptides and to increase thefluorescence signal of expressed fusion nucleic acids.

As provided more fully below, cDNA may be generated from any number oforganisms and cells types, including cDNAs generated from eukaryotic andprokaryotic cells, viruses, cells infected with viruses, pathogens, orfrom genetically altered cells. The cDNA may encode specific domains,such as signaling domains, protein-interaction domains, membrane bindingdomains, targeting domains, and the like. Furthermore, the cDNA may beframeshifted by adding or deleting nucleotides, which may result in anout of frame construct, such that a pseudorandom peptide or protein isencoded. In addition, the cDNAs and cDNA libraries contemplate varioussubtracted cDNA or enriched cDNA libraries (e.g., secreted or membraneproteins; see Kopczynski, C. C. (1998) Proc. Natl. Acad. Sci. USA 95:9973–78). That is, a cDNA library may be a complete cDNA library from acell, a partial library, an enriched library from one or more celltypes, or a constructed library with certain cDNAs being removed to forma library.

In another preferred embodiment, the fusion polypeptides comprise rGFPor pGFP fused to proteins or peptides encoded by genomic DNA. Aselaborated above for cDNA, the genomic DNA can be derived from anynumber of organisms or cells, including genomic DNA of eukaryotic orprokaryotic cells, or viruses. They may be from normal cells or cellsdefective in cellular processes, such as tumor suppression, cell cyclecontrol, or cell surface adhesion. As more fully explained below, thegenomic DNA may be from entire genomic constructs or fractionatedconstructs, including random or targeted fractionation.

In another preferred embodiment, the fusion polypeptides comprise rGFPor pGFP fused to random peptides. Generally, peptides ranging from about4 amino acids in length to about 100 amino acids may be used, withpeptides ranging from about 5 to about 50 being preferred, with fromabout 8 to about 30 being particularly preferred and from about 10 toabout 25 being especially preferred. As more fully explained below, thepeptides are fully randomized or they are biased in their randomization.In one preferred embodiment, the random peptide is linked to a fusionpartner to structurally constrain the peptide and allow properinteraction with other molecules while in another preferred embodiment,the expressed random peptide is not linked to a fusion partner. Randompeptides expressed as fusions with rGFP, pGFP, or variants thereof maybe screened for its ability to produce an altered cellular phenotype.

For the fusion polypeptides of the present invention, the fusions aremade in a variety of ways. In one preferred embodiment, the peptide isfused to the N-terminus of the rGFP or pGFP. The fusion can be direct,i.e., with no additional residues between the C-terminus of the peptideand the N-terminus of the rGFP or pGFP, or indirect; that is,intervening amino acids are used, such as one or more fusion partners,including a linker. In this embodiment, when the fusion are to peptides,such as random peptides or protein interaction domains, preferably apresentation structure is used to confer some conformational stabilityto the peptide. Particularly preferred embodiments include the use ofdimerization sequences.

In one embodiment, N-terminal residues of the rGFP or pGFP are deleted,i.e., one or more amino acids of the rGFP or pGFP can be deleted andreplaced with the protein or peptide of interest. However, as notedabove, deletions of more than 7 amino acids may render the rGFP or pGFPless fluorescent, and thus larger deletions are generally not preferred.In a preferred embodiment, the fusion is made directly to the firstamino acid of the rGFP or pGFP.

In a preferred embodiment, the peptide is fused to the C-terminus of therGFP or pGFP. As above for N-terminal fusions, the fusion can be director indirect, and C-terminal residues may be deleted.

In a preferred embodiment, proteins, peptides and fusion partners areadded to both the N- and the C-terminal regions of the rGFP or pGFP. Asthe N- and C-terminal region of rGFP and pGFP are putatively on the same“face” of the protein as is the case for aGFP, in spatial proximity(within 18 Å), it is possible to make a non-covalently “circular” rGFPor pGFP using the components of the invention. Thus, for example, theuse of dimerization sequences can allow formation of a noncovalentlycyclized protein. By attaching a first dimerization sequence to eitherthe N- or C-terminus of rGFP or pGFP, and adding a peptide of interestand a second dimerization sequence to the other terminus, a largecompact structure can be formed, with the protein or peptide displayedin a structure constrained by the dimerization sequences.

In a preferred embodiment, the protein or peptide of interest is fusedto an internal position of the rGFP or pGFP; that is, the peptide isinserted at an internal position of the rGFP or pGFP. While the peptidecan be inserted at virtually any position, preferred positions includeinsertion at the very tips of “loops” on the surface of the rGFP orpGFP, to minimize disruption of the rGFP and pGFP β-can proteinstructure. Thus, the rGFP or pGFP fusion polypeptide retains its abilityto fluoresce, or maintain a T_(m) of at least 42° C. under assayconditions.

In a preferred embodiment, the proteins, peptides or other fusionpartner is inserted in rGFP and/or pGFP loops. That is, as outlinedbelow, peptides or libraries of peptides can be inserted into (e.g.,without replacing any residues) or replace external loops by theaddition of the peptides or other fusion partners to replace one or moreof the native residues. In a preferred embodiment, the loop comprisesresidues from about 51 to about 62 for rmGFP or pGFP, and residues fromabout 48 to about 58 for rrGFP. Similar preferred embodiments utilizereplacements or insertions at positions from about 79 to about 84 ofboth rmGFP and pGFP (about 76 to about 81 for rrGFP); replacements orinsertions at positions from about 101 to about 107 (about 99 to about104 for rrGFP); replacements or insertions at positions from about 117to about 120 (about 114 to about 117 for rrGFP); replacements orinsertions at positions from about 130 to about 148 (about 127 to about145 for rrGFP); replacements or insertions at positions from about 154to about 160 (about 151 to about 157 for rrGFP); replacements orinsertions at positions from about 170 to about 170–177 (about 167 toabout 174 for rrGFP); replacements or insertions at positions from about186 to about 197 (about 183 to about 194 for rrGFP); and replacements orinsertions at positions from about 206 to about 213 (about 202 to about211 for rrGFP). More preferably, the insertion or replacement will takeplace between residues 117–120 for rmGFP or pGFP (114–117 for rrGFP);170–177 (167–174 for rrGFP); or 206–213 (202–211 for rrGFP). Mostpreferably the insertion will take place between residues 170–177 or208–213 of rmGFP or pGFP and corresponding residues of rrGFP.

In a preferred embodiment, the peptide of interest is inserted, withoutany deletion of rGFP or pGFP residues. That is, the insertion point isbetween two amino acids in the loop, adding the new amino acids of thepeptide and fusion partners, including linkers. Generally, when linkersare used, the linkers are directly fused to the rGFP or pGFP, withadditional fusion partners, if present, being fused to the linkers andthe peptides.

In a preferred embodiment, the peptide is inserted into the rGFP orpGFP, with one or more rGFP or pGFP residues being deleted; that is, thepeptide (and fusion partners, including linkers) replaces one or moreresidues. In general, when linkers are used, the linkers are attacheddirectly to the rGFP or pGFP. Thus, it is linker residues which replacethe GFP residues, again generally at the tip of the loop. In general,when residues are replaced, from one to five residues of GFP aredeleted, with deletions of one, two, three, four and five amino acidsall possible. In another preferred embodiment, fusion polypeptides ofthe invention do not include linkers. When linkers are not used, thefusion polypeptides will be significantly more constrained because ofthe reduction in conformational freedom imposed by the GFP structure.

In a preferred embodiment, peptides (including fusion partners, ifapplicable) can be inserted into more than one loop of the scaffold, theamino terminal region, the carboxy terminal region, or combinationsthereof. Thus, for example, adding peptides to two loops can increasethe complexity of a random peptide library but still allow presentationof these loops on the same face of the protein. Similarly, it ispossible to add peptides to one or more loops, and add other fusionpartners to other loops, or amino terminal or carboxy terminal regions,for example targeting sequences, etc., to provide additional biologicalproperties to the fusion polypeptide or to localize the peptide tosubcellular or extracellular compartments where molecular interactionscan take place.

Accordingly, in a preferred embodiment, the fusion polypeptides mayfurther comprise fusion partners. By “fusion partner” herein is meant asequence that is associated with the peptide that confers upon allmembers of the library in that class a common function or ability.Fusion partners can be heterologous (i.e., not native to the host cell),or synthetic (i.e., not native to any cell). Suitable fusion partnersinclude, but are not limited to: a) presentation structures, as definedbelow, which provide the peptides in a conformationally restricted orstable form; b) targeting sequences, defined below, which allow thelocalization of the peptide into a subcellular or extracellularcompartment; c) rescue sequences as defined below, which allow thepurification or isolation of either the peptides or the nucleic acidsencoding them; d) stability sequences, which affects stability orprotection from degradation to the peptide or the nucleic acid encodingit, for example resistance to proteolytic degradation; e) linkersequences, which conformationally decouple the random peptide elementsfrom the scaffold itself, which keep the peptide from interfering withscaffold folding; f) any protein of interest; or g) any combination ofthe above, as well as linker sequences as needed. Since particularfusion partners are active in certain organisms or cells while notactive in others, those skilled in the art can choose the appropriatefusion partner for particular cells or organisms.

In a preferred embodiment, the fusion partner is itself a presentationstructure. By “presentation structure” or grammatical equivalents hereinis meant a sequence, which, when fused to peptides, causes the peptidesto assume a conformationally restricted form. Proteins interact witheach other largely through conformationally constrained domains.Although small peptides with freely rotating amino and carboxyl terminican have potent functions as is known in the art, the conversion of suchpeptide structures into pharmacologic agents is difficult due to theinability to predict side-chain positions for peptidomimetic synthesis.Therefore the presentation of peptides in conformationally constrainedstructures will benefit both the later generation of pharmaceuticals andwill also likely lead to higher affinity interactions of the peptidewith a target protein. This fact has been recognized in thecombinatorial library generation systems using biologically generatedshort peptides in bacterial phage systems. A number of workers haveconstructed small domain molecules in which one might present peptidestructures (e.g., randomized peptide sequences).

Thus, synthetic presentation structures, i.e. artificial polypeptides,are capable of presenting a peptide as a conformationally-restricteddomain. Generally such presentation structures comprise a first portionjoined to the N-terminal end of the peptide of interest, and a secondportion joined to the C-terminal end of the peptide; that is, thepeptide is inserted into the presentation structure, although variationsmay be made, as outlined below, in which elements of the presentationstructure are included within the peptide sequence. To limit thebackground cellular effects of protein sequences that are not part ofthe expressed protein or peptide of interest, the presentationstructures are selected or designed to have minimal biologicallyactivity when expressed in the target cell.

Preferred presentation structures enhance interaction with bindingpartners by conformationally constraining the displayed peptide andmaximizing accessibility to the peptide by presenting it on an exteriorsurface such as a loop. Accordingly, suitable presentation structuresinclude, but are not limited to, dimerization sequences, minibodystructures, loops on, β-turns and coiled-coil stem structures in whichresidues not critical to structure are randomized, zinc-finger domains,cysteine-linked (disulfide) structures, transglutaminase linkedstructures, cyclic peptides, B-loop structures, helical barrels or4-helix bundles, leucine zipper motifs, etc.

In a preferred embodiment, the presentation structure is a coiled-coilstructure, allowing the presentation of a peptide, especially a randompeptide, on an exterior loop (see Myszka et al. (1994) Biochem. 33:2362–2373, hereby incorporated by reference). Using this systeminvestigators, have isolated peptides capable of high affinityinteraction with the appropriate target. In general, coiled-coilstructures allow for between 6 to 20 randomized positions.

A preferred coiled-coil presentation structure is as follows:MGCAALESEVSALESEVASLESEVAALGRGDMPLAAVKSKLSAVKSKLASVKSKLAACGPP (SEQ IDNO: 24). The underlined regions represent a coiled-coil leucine zipperregion defined previously (see Martin et al. (1994) EMBO J. 13:5303–09,incorporated by reference). The bolded GRGDMP region represents the loopstructure and when appropriately replaced with peptides (i.e., peptides,generally depicted herein as (X)_(n), where X is an amino acid residueand n is an integer of at least 5 or 6) can be of variable length. Thereplacement of the bolded region is facilitated by encoding restrictionendonuclease sites in the underlined regions, which allows the directincorporation of oligonucleotides encoding peptides of interest at thesepositions. For example, a preferred embodiment generates a XhoI site atthe double underlined LE site and a HindIII site at thedouble-underlined KL site.

In a preferred embodiment, the presentation structure is a minibodystructure. A “minibody” is essentially composed of a minimal antibodycomplementarity region. The minibody presentation structure generallyprovides two peptide regions that are presented along a single face ofthe tertiary structure in the folded protein (see Bianchi et al. (1994)J. Mol. Biol. 236: 649–59, and references cited therein, all of whichare incorporated by reference). Investigators have shown this minimaldomain is stable in solution and have used phage selection systems incombinatorial libraries to select minibodies with displayed peptidesequences exhibiting high affinity, K_(d)=10⁻⁷, for the pro-inflammatorycytokine IL-6.

A preferred minibody presentation structure is as follows:

MGRNSQATSGFTFSHFYMEWVRGGEYIAASRHKHNKYTTEYSASVKGRYIVSRDTSQSILYLQKKK GPP(SEQ ID NO: 25). The bold, underlined regions are the regions which maybe replaced with a peptide or randomized. The italicized phenylalaninemust be invariant in the first peptide display region. The entirepeptide is cloned in a three-oligonucleotide variation of thecoiled-coil embodiment, thus allowing two different peptides of interestto be incorporated simultaneously. This embodiment utilizesnon-palindromic BstXI sites on the termini.

In a preferred embodiment, the presentation structure is a sequence thatcontains generally two cysteine residues, such that a disulfide bond maybe formed, resulting in a conformationally constrained sequence. Thisembodiment is particularly preferred ex vivo, for example when secretorytargeting sequences are used. As will be appreciated by those in theart, any number of peptide sequences, with or without spacer or linkingsequences, may be flanked with cysteine residues. In other embodiments,effective presentation structures may be generated by the peptides ofinterest themselves. For example, the random peptides may be “doped”with cysteine residues which, under the appropriate redox conditions,may result in highly crosslinked structured conformations, similar to apresentation structure. Similarly, the randomization regions may becontrolled to contain a certain number of residues to confer β-sheet orα-helical structures.

In a preferred embodiment, the presentation sequence confers the abilityto bind metal ions to confer secondary structure. Thus, for example,C₂H₂ zinc finger sequences are used; C₂H₂ sequences have two cysteinesand two histidines placed such that a zinc ion is chelated. Zinc fingerdomains are known to occur independently in multiple zinc-fingerpeptides to form structurally independent, flexibly linked domains (seeNakaseko, Y. et al. (1992) J. Mol. Biol. 228: 619–36). A generalconsensus sequence is (5 amino acids)-C-(2 to 3 amino acids)-C-(4 to 12amino acids)-H-(3 amino acids)-H-(5 amino acids) (SEQ ID NO: 26). Apreferred example would be -FQCEEC-peptide of 3 to 20 aminoacids-HIRSHTG-(SEQ ID NO: 27).

Similarly, CCHC boxes can be used that have a consensus sequence -C-(2amino acids)-C-(4 to 20 amino acid peptide)-H-(4 amino acids)-C-(SEQ IDNO: 28)(see Bavoso, A. et al. (1998) Biochem. Biophys. Res. Commun. 242:385–89, hereby incorporated by reference). Preferred examples include(1)-VKCFNC-4 to 20 amino acids-HTARNCR-(SEQ ID NO: 29), based on thenucleocapsid protein P2; (2) a sequence modified from that of thenaturally occurring zinc-binding peptide of the Lasp-1 LIM domain(Hammarstrom, A. et al. (1996) Biochemistry 35: 12723–32); and(3)-MNPNCARCG-4 to 20 amino acid peptide-HKACF-(SEQ ID NO: 30), based onthe NMR structural ensemble 1ZFP (Hammarstrom, A et al., supra).

In a preferred embodiment, the presentation structure includes twodimerization sequences, including self-binding peptides. A dimerizationsequence allows the non-covalent association of two peptide sequences,which can be the same or different, with sufficient affinity to remainassociated under normal physiological conditions. These sequences may beused in several ways. In a preferred embodiment, one terminus of theprotein or peptide is joined to a first dimerization sequence and theother terminus is joined to a second dimerization sequence, which can bethe same or different from the first sequence. This allows the formationof a loop upon association of the dimerizing sequences. Alternatively,the use of these sequences effectively allows small libraries ofpeptides (for example, 10⁴) to become large libraries if two peptidesper cell are generated which then dimerize, to form an effective libraryof 10⁸ (10⁴×10⁴). It also allows the formation of longer protein orpeptide libraries, if needed, or more structurally complex peptidemolecules. The dimers may be homo- or heterodimers.

Dimerization sequences may be a single sequence that self-aggregates, ortwo different sequences that associate. That is, nucleic acids are madeencoding both a first peptide with dimerization sequence 1, and a secondpeptide with dimerization sequence 2, such that upon introduction into acell and expression of the nucleic acid, dimerization sequence 1associates with dimerization sequence 2 to form a new peptide structure.The use of dimerization sequences allows the noncovalent “constraint” ofthe displayed peptides; that is, if a dimerization sequence is used ateach terminus of the peptide, the resulting structure can form aconstrained structure. Furthermore, the use of dimerizing sequencesfused to both the N- and C-terminus of the scaffold such as rGFP or pGFPforms a noncovalently constrained scaffold peptide library.

Suitable dimerization sequences will encompass a wide variety ofsequences. Any number of protein—protein interaction sites are known. Inaddition, dimerization sequences may also be elucidated using standardmethods such as the yeast two hybrid system, traditional biochemicalaffinity binding studies, or even using the present methods (see forexample, WO 99/51625, hereby incorporated by reference in its entirety).Particularly preferred dimerization peptide sequences include, but arenot limited to, -EFLIVKS-(SEQ ID NO: 31), EEFLIVKKS-(SEQ ID NO: 32),-FESIKLV-(SEQ ID NO: 33), and -VSIKFEL-(SEQ ID NO: 34). More preferreddimerization peptide sequences include EEEFLIVEEE (SEQ ID NO: 35) whenused together with KKKFLIVKKK (SEQ ID NO: 36).

In a preferred embodiment, the fusion partner is a targeting sequence.As will be appreciated by those in the art, the localization of proteinswithin a cell is a simple method for increasing effective concentrationwithin a defined compartment. For example, RAF1 when localized to themitochondrial membrane can inhibit the anti-apoptotic effect of BCL-2.Similarly, membrane bound Sos induces Ras mediated signaling inT-lymphocytes. These mechanisms are thought to rely on the principle ofincreasing the protein concentration in a limited volume within a cell;that is to say, the localization of a protein to the plasma membranelimits the search for its ligand to that limited dimensional space nearthe membrane as opposed to the three dimensional space of the cytoplasm.Alternatively, the concentration of a protein can also be simplyincreased by nature of the localization. Shuttling the proteins into thenucleus confines them to a smaller space thereby increasingconcentration. Finally, the ligand or target may simply be present in aspecific compartment such that effectors (e.g., inhibitors) must belocalized appropriately.

Thus, suitable targeting sequences include, but are not limited to,binding sequences capable of causing binding of the expression productto a predetermined molecule or class of molecules while retainingbioactivity of the expression product (for example by using enzymeinhibitor or substrate sequences to target a class of relevant enzymes);sequences signaling selective degradation, of itself or co-boundproteins; and signal sequences capable of constitutively localizing thepeptides to a predetermined cellular locale, including a) subcellularlocations such as the Golgi, endoplasmic reticulum, nucleus, nucleoli,nuclear membrane, mitochondria, chloroplast, secretory vesicles,lysosome, periplasmic space, cellular membrane; and b) extracellularlocations via a secretory signal. Particularly preferred is localizationto either subcellular locations or to the outside of the cell viasecretion.

In a preferred embodiment, the targeting sequence is a nuclearlocalization signal (NLS). NLSs are generally short, positively charged(basic) domains that serve to direct the entire protein in which theyoccur to the cell's nucleus. Numerous NLS amino acid sequences have beenreported including single basic NLS's such as that of the SV40 (monkeyvirus) large T Antigen (PKKKRKV (SEQ ID NO: 37), Kalderon, D. et al.(1984) Cell 39: 499–509); the human retinoic acid receptor-β nuclearlocalization signal (ARRRRP (SEQ ID NO: 38)), NFκB p50 (EEVQRKRQKL (SEQID NO: 39), Ghosh, S. et al. (1990) Cell 62: 1019–29); NFκB p65(EEKRKRTYE (SEQ ID NO: 40), Nolan, G. et al. (1991) Cell 64: 961–99; andothers (see for example, Boulikas, T. (1994) J. Cell. Biochem. 55:32–58, hereby incorporated by reference) and double basic NLS'sexemplified by that of the Xenopus (African clawed toad) protein,nucleoplasmin (AVKRPAATKKAGQAKKKKLD (SEQ ID NO: 41); Dingwall, C. et al.(1982) Cell, 30: 449–58 and Dingwall, S. et al. (1988) J. Cell Biol.107: 641–49). Numerous localization studies have demonstrated that NLSsincorporated in synthetic peptides or grafted onto proteins not normallytargeted to the cell nucleus cause these peptides and proteins toconcentrate in the nucleus (see Dingwall S. et al. (1986) Ann. Rev. CellBiol. 2: 367–90; Bonnerot, C. et al. (1987) Proc. Natl. Acad. Sci. USA84: 6795–99; Galileo, and D. S. et al. (1990) Proc. Natl. Acad. Sci. USA87: 458–62.)

Membrane-anchoring sequences are well known in the art and are based onthe genetic geometry of mammalian transmembrane molecules. Peptides areinserted into the membrane via a signal sequence (designated herein asssTM) and stably held in the membrane through a hydrophobictransmembrane domain (TM). The transmembrane proteins are positioned inthe membrane such that the protein region encompassing the aminoterminus relative to the transmembrane domain are extracellular and theregion towards the carboxy terminal are intracellular. Of course, if theposition of transmembrane domains is towards the amino end of theprotein relative to the peptide of interest, the TM will serve toposition the peptide intracellularly, which may be desirable in someembodiments. ssTMs and TMs are known for a wide variety of membranebound proteins, and these sequences are used accordingly, either aspairs from a particular protein or with each component being taken froma different protein. Alternatively, the ssTM and TM sequences aresynthetic and derived entirely from consensus sequences, thus serving asartificial delivery domains.

As will be appreciated by those in the art, membrane-anchoringsequences, including both ssTM and TM, are known for a wide variety ofproteins and any of these are useful in the present invention.Particularly preferred membrane-anchoring sequences include, but are notlimited to, those derived from CD8, ICAM-2, IL-8R, CD4 and LFA-1. Otheruseful ssTM and TM domains include sequences from: (a) class I integralmembrane proteins, such as IL-2 receptor beta-chain (residues 1–26 arethe signal sequence, 241–265 are the transmembrane residues; seeHatakeyama, M. et al. (1989) Science 244: 551–56 and von Heijne, G. etal. (1988) Eur. J. Biochem. 174: 671–78) and insulin receptor β chain(residues 1–27 are the signal domain, 957–959 are the transmembranedomain and 960–1382 are the cytoplasmic domain; see Hatakeyama, supra,and Ebina, Y. et al. (1985) Cell 40: 747–58); (b) class 11 integralmembrane proteins, such as neutral endopeptidase (residues 29–51 are thetransmembrane domain, 2–28 are the cytoplasmic domain; see Malfroy, B.et al. (1987) Biochem. Biophys. Res. Commun. 144: 59–66); (c) type IIIproteins such as human cytochrome P450 NF25 (Hatakeyama, supra); and (d)type IV proteins, such as human P-glycoprotein (Hatakeyama, supra).Particularly preferred are CD8 and ICAM-2. For example, the signalsequences from CD8 and ICAM-2 lie at the extreme 5′ end of thetranscript. These consist of the amino acids 1–32 in the case of CD8(MASPLTRFLSLNLLLLGESILGSGEAKPQAP (SEQ ID NO: 42), Nakauchi, H. et al.(1985) Proc. Natl. Acad. Sci. USA 82: 5126–30) and amino acids 1–21 inthe case of ICAM-2 (MSSFGYRTLTVALFTLICCPG (SEQ ID NO: 43), Staunton, D.E. et al. (1989) Nature 339: 61–64). These leader sequences deliver theconstruct to the membrane while the hydrophobic transmembrane domainsplaced at the carboxy terminal region relative to the peptide ofinterest or peptide candidate agents serve to anchor the construct inthe membrane. These transmembrane domains are encompassed by amino acids145–195 from CD8 (PQRPEDCRPRGSVKGTGLDFACDIYIWAPLAGICVALLLSLIITLICYHSR(SEQ ID NO: 44), Nakauchi, supra) and 224–256 from ICAM-2(MVIIVTVVSVLLSLFVTSVLLCFIFGQHLRQQR (SEQ ID NO: 45), Staunton, supra).

Alternatively, membrane anchoring sequences include the GPI anchor,which results in a covalent bond between the molecule and the lipidbilayer via a glycosyl-phosphatidylinositol bond. The GPI anchorsequence is exemplified by protein DAF, which comprises the sequencePNKGSGTTSGTTRLLSGHTCFTLTGLLGTLVTMGLLT (SEQ ID NO: 46), with the boldedserine the site of the anchor; (see Homans, S. W. et al. (1988) Nature333: 269–72 and Moran, P. et al. (1991) J. Biol. Chem. 266: 1250–57).Adding GPI anchor sites is accomplished by inserting the GPI sequencefrom Thy-1 in the carboxy terminal region relative the inserted peptideof interest or randomized peptide. Thus, the GPI anchor sequencesreplaces the transmembrane domain in these constructs.

Similarly, acylation signals for attachment of lipid moieties can alsoserve as membrane anchoring sequences (see Stickney, J. T. (2001)Methods Enzymol. 332: 64–77). It is known that the myristylation ofc-src localizes the kinase to the plasma membrane. This propertyprovides a simple and effective method of membrane localization giventhat the first 14 amino acids of the protein are solely responsible forthis function: MGSSKSKPKDPSQR (SEQ ID NO: 47)(see Cross, F. R. et al.(1984) Mol. Cell. Biol. 4: 1834–42; Spencer, D. M. et al. (1993) Science262: 1019–24, both of which are hereby incorporated by reference) orMGQSLTTPLSL (SEQ ID NO: 48). The modification at the glycine residue (inbold) of the motif is effective in localizing reporter genes and can beused to anchor the zeta chain of the TCR. The myristylation signal motifis placed at the amino end relative to the variable region (or proteinof interest) in order to localize the construct to the plasma membrane.Another lipid modification is isoprenoid attachment, which includes the15 carbon farnesyl or the 20 carbon geranyl-geranly group. The conservedsequence for isoprenoid attachment comprises CaaX motif with thecysteine residue as the lipid modified amino acid. The X residuedetermines the type of isoprenoid modification. The preferred isoprenoidis geranyl—geranyl when X is a leucine or phenylalanine (Farnsworth, C.C. et al. (1994) Proc. Natl. Acad. Sci. USA 91: 11963–67). Farnesyl isthe preferred lipid for a broader range of X amino acids such asmethionine, serine, glutamine and alanine. The “aa” in the isoprenoidattachment motif are generally aliphatic residues, although otherresidues are also functional. Farnesylation sequences include carboxyterminal SKDGKKKKKKSKTKCVIM (SEQ ID NO: 49) of K-Ras4B. Other isoprenoidattachment motifs are found in the C termini of N and H-Ras GTPases(Aronheim, A. et al. (1994) Cell 78: 949–61). Attachment of farnesylgroups to various forms of GFP provides a useful marker for monitoringcell membrane morphology and cell sorting by FACS. Moreover, cellsretain the farnesylated forms upon treating the cells with fixativewhile cytoplasmic forms of GFP may leach out of the cell.

In addition, localization to the cell membrane by lipid modification isalso achieved by palmitoylation. Attachment of the palmitoyl group canbe directed to either the amino or carboxy terminal region relative tothe protein of interest. In addition, multiple palmitoyl residues orcombinations of palmitoyl and isoprenoids are possible. Amino terminaladditions of palmitoyl group may use the sequence MVCCMRRTKQV (SEQ IDNO: 50) from Gap43 protein while carboxy terminal modifications arepossible with CMSCKCVLKKKKKK (SEQ ID NO: 51) from Ras mutant (modifiedamino acids in bold). Other palmitoylation sequences are found in Gprotein-coupled receptor kinase GRK6 sequence(LLQRLFSRQDCCGNCSDSEEELPTRL (SEQ ID NO: 52), Stoffel, R. H. et al.(1994) J. Biol. Chem. 269: 27791–94); rhodopsin (KQFRNCMLTSLCCGKNPLGD(SEQ ID NO: 53), Barnstable, C. J. et al. (1994) J. Mol. Neurosci. 5:207–09); and the p21H-ras 1 protein (LNPPDESGPGCMSCKCVLS (SEQ ID NO:54), Capon, D. J. et al. (1983) Nature 302: 33–37). Use of the carboxyterminal sequence LNPPDESGPGC(p)MSC(p)KC(f)VLS (SEQ ID NO: 55) of H-Ras(modified amino acids in bold; p is palmitoyl group and f is farnesylgroup) allows attachment of both palmitoyl and farnesyl lipids.

In a preferred embodiment, the targeting sequence is a lysosomaltargeting sequence, including, for example, a lysosomal degradationsequence such as Lamp-2 (KFERQ (SEQ ID NO: 56), Dice, J. F. (1992) Ann.N.Y. Acad. Sci. 674: 58–64); or lysosomal membrane sequences from Lamp-1(MLIPIAGFFALAGLVLIVLIAYLIGRKRSHAGYQTI (SEQ ID NO: 57), Uthayakumar, S.et al. (1995) Cell. Mol. Biol. Res. 41: 405–20) or Lamp-2(LVPIAVGAALAGVLILVLLAYFIGLKHHHAGYEQF (SEQ ID NO: 58), Konecki, D. S. etal. (1994) Biochem. Biophys. Res. Comm. 205: 1–5; where italicizedresidues comprise the transmembrane domains and underlined residuescomprise the cytoplasmic targeting signal):

Alternatively, the targeting sequence may be a mitochondriallocalization sequence, including mitochondrial matrix sequences (e.g.,yeast alcohol dehydrogenase III;

MLRTSSLFTRRVQPSLFSRNILRLQST (SEQ ID NO: 59), Schatz, G. (1987) Eur. J.Biochem. 165: 1–6); mitochondrial inner membrane sequences (yeastcytochrome c oxidase subunit IV;

MLSLRQSIRFFKPATRTLCSSRYLL (SEQ ID NO: 60), Schatz, supra); mitochondrialintermembrane space sequences (yeast cytochrome c1;

MFSMLSKRWAQRTLSKSFYSTATGMSKSGKLTQKLVTAGVAAAGITASTLLYADSLTAEAMTA (SEQ IDNO: 61), Schatz, supra); or mitochondrial outer membrane sequences(yeast 70 kD outer membrane protein;MKSFITRNKTAILATVMTGTAIGAYYYYNQLQQQQQRGKK (SEQ ID NO: 62), Schatz,supra).

The target sequences may also be endoplasmic reticulum sequences,including the sequences from calreticulin (KDEL (SEQ ID NO: 63), Pelham,H. R. (1992) Royal Society London Transactions B; 1–10) or adenovirusE3/19K protein (LYLSRRSFIDEKKMP (SEQ ID NO: 64), Jackson, M. R. et al.(1990) EMBO J. 9: 3153–62). Furthermore, targeting sequences alsoinclude peroxisome sequences (for example, the peroxisome matrixsequence of luciferase, SKL; Keller, G. A. et al. (1987) Proc. Natl.Acad. Sci. USA 4: 3264–68); or destruction sequences (cyclin B1,RTALGDIGN (SEQ ID NO: 65); Klotzbucher, A. et al. (1996) EMBO J. 1:3053–64).

In a preferred embodiment, the targeting sequence is a secretory signalsequence capable of effecting the secretion of the peptide of interestor peptide candidate agent. There are a large number of known secretorysignal sequences which direct secretion of the peptide into theextracellular space when placed at the amino end relative to the peptideof interest. Secretory signal sequences and their transferability tounrelated proteins are well known (see Silhavy, T. J. et al. (1985)Microbiol. Rev. 49: 398–418). Secretion of the peptide is particularlyuseful for generating peptides capable of binding to the surface of, oraffecting the physiology of target cells other than the host cell, e.g.,the cell infected with the retrovirus. In a preferred approach, a fusionproduct is configured to contain, in series, secretion signalpeptide-presentation structure-randomized peptide region or protein ofinterest-presentation structure. In this manner, target cells grown inthe vicinity of cells expressing the library of peptides are exposed tothe secreted peptide. Target cells exhibiting a physiological change inresponse to the presence of the secreted peptide (i.e., by the peptidebinding to a surface receptor or by being internalized and binding tointracellular targets) and the peptide secreting cells are localized byany of a variety of selection schemes and the structure of the peptideeffector identified. Exemplary effects include that of a designercytokine (i.e., a stem cell factor capable of causing hematopoietic stemcells to divide and maintain their totipotential), a factor causingcancer cells to undergo spontaneous apoptosis, a factor that binds tothe cell surface of target cells and labels them specifically, etc.

Suitable secretory sequences are known, including signals from IL-2(MYRMQLLSCIALSLALVTNS (SEQ ID NO: 66), Villinger, F. et al. (1995) J.Immunol. 155: 3946–54), growth hormone (MATGSRTSLLLAFGLLCLPWLQEGSAFPT(SEQ ID NO: 67), Roskam, W. G. et al. (1979) Nucleic Acids Res. 7:305–20); preproinsulin (MALWMRLLPLLALLALWGPDPAAAFVN (SEQ ID NO: 68),Bell, G. I. et al. (1980) Nature 284: 26–32); and influenza HA protein(MKAKLLVLLYAFVAGDQI (SEQ ID NO: 69), Sekikawa, K. et al. (1983) Proc.Natl. Acad. Sci. USA 80: 3563–67), with cleavage occurring between thenonunderlined-underlined junction. A particularly preferred secretorysignal sequence is the signal leader sequence from the secreted cytokineIL-4, MGLTSQLLPPLFFLLACAGNFVHG (SEQ ID NO: 70), which comprises thefirst 24 amino acids of IL-4.

In a preferred embodiment, the fusion partner is a rescue sequence. Arescue sequence is a sequence which may be used to purify or isolateeither the peptide of interest or the candidate agent or the nucleicacid encoding it. Thus, for example, peptide rescue sequences includepurification sequences such as the His₆ tag for use with Ni⁺² affinitycolumns and epitope tags useful for detection, immunoprecipitation, orFACS (fluorescence-activated cell sorting). Suitable epitope tagsinclude myc (for use with the commercially available 9E10 antibody), theBSP biotinylation target sequence of the bacterial enzyme BirA, flutags, lacZ, GST, and Strep tag I and II.

Alternatively, the rescue sequence may be a unique oligonucleotidesequence which serves as a probe target site to allow the quick and easyisolation of the retroviral construct, via PCR, related techniques, orhybridization.

In a preferred embodiment, the fusion partner is a stability sequencethat affects the stability of the peptide of interest or candidatebioactive agent. In one aspect, the stability sequence confers stabilityto the peptide of interest or candidate bioactive agent. For example,peptides may be stabilized by the incorporation of glycines after theinitiating methionine (MG or MGG), for protection of the peptide toubiquitination as per Varshavsky's N-End Rule, thus conferring increasedhalf-life in the cell (see Varshavsky, A. (1996) Proc. Natl. Acad. Sci.USA 93: 12142–49). Similarly, adding two prolines at the C-terminusmakes peptides that are largely resistant to carboxypeptidase action.The presence of two glycines prior to the prolines impart bothflexibility and prevent structure perturbing events in the di-prolinefrom propagating into the peptide structure. Thus, preferred stabilitysequences are MG(X)_(n)GGPP (SEQ ID NO: 71), where X is any amino acidand n is an integer of at least four.

In another aspect, the stability sequence decreases the stability of thepeptide of interest or candidate bioactive agent. Sequences, such asPEST sequences (i.e., polypeptide sequences enriched in proline (P),glutamic acid (E), serine (S) and threonine (T); see Rechsteiner, M.(1996) Trends Biochem. Sci. 21: 267–71) and destruction boxes (Glotzer,M. (1991) Nature 349: 132–38) destabilize proteins by targeting proteinsfor degradation. For example, fusion of PEST sequences to GFP reporterprotein decreases the half-life of GFP, thus providing an indicator ofdynamic cellular processes, including, but not limited to, regulatedprotein degradation, reporter for transcriptional activity, and cellcycle status (Mateus, C. et al. (2000) Yeast 16: 1313–23; Li. X. (1998)J. Biol. Chem. 273: 34970–75). Numerous PEST sequences useful fortargeting peptides for degradation are known. These include amino acids422–461 of ornithine decarboxylase (Corish, P. (1999) Protein Eng. 12:1035–40; Li, X, et al., U.S. Pat. No. 6,130,313) and the C terminalsequences of IκBα (Lin, R. (1996) Mol. Cell Biol. 16: 1401–09).Destruction boxes found in cell cycle proteins, for example cyclin B1,can also reduce the half-life of fusion proteins but in a cell cycledependent manner (Corish, P., supra).

The fusion partners may be placed anywhere (i.e., N-terminal,C-terminal, internal loops) in the structure as the biology and activitypermits. In addition, while the discussion has been directed to thefusion of fusion partners to the peptide or protein of interest of thefusion polypeptide, it is also possible to fuse one or more of thesefusion partners to the rGFP or pGFP portion of the fusion polypeptide.Thus, for example, the rGFP or pGFP may contain a targeting sequence(either N-terminal region, C-terminal region, or internal region, asdescribed above) at one location, and a rescue sequence in the sameplace or a different place on the molecule. Thus, any combination offusion partners, peptides of interest, and rGFP or pGFP proteins may bemade.

In a preferred embodiment, the fusion partner includes a linker orspacer sequence. Linker sequences between various targeting sequences,for example, membrane targeting sequences, and the other components ofthe constructs, such as the randomized peptides, may be desirable toallow the peptides to interact with potential targets unhindered. Forexample, useful linkers include glycine polymers (G)_(n), glycine-serinepolymers (including, for example, (GS)_(n), (GSGGS)_(n) (SEQ ID NO: 72)and (GGGS)_(r) (SEQ ID NO: 73), where n is an integer of at least one),glycine-alanine polymers, alanine-serine polymers, and other flexiblelinkers, such as the tether for the shaker potassium channel, and alarge variety of other flexible linkers, as will be appreciated by thosein the art. Glycine and glycine-serine polymers are preferred since bothof these amino acids are relatively unstructured, and therefore may beable to serve as a neutral tether between components. Glycine polymersare the most preferred as glycine accesses significantly more phi-psispace than even alanine, and is much less restricted than residues withlonger side chains (see Scheraga, H. A. (1992) Rev. Computational Chem.III 73–142). Secondly, serine is hydrophilic and therefore able tosolubilize what could be a globular glycine chain. Third, similar chainshave been shown to be effective in joining subunits of recombinantproteins such as single chain antibodies.

In a preferred embodiment, the peptide is connected to the rGFP or pGFPvia linkers. That is, while one embodiment utilizes the direct linkageof the peptide of interest to the rGFP or pGFP, or of the peptide andany fusion partners to the rGFP or pGFP protein, a preferred embodimentutilizes linkers at one or both ends of the peptide. That is, whenattached either to the N- or C-terminus, one linker may be used. Whenthe peptide of interest is inserted in an internal position, as isgenerally outlined above, preferred embodiments utilize at least onelinker and preferably two, one at each terminus of the peptide. Linkersare generally preferred for conformationally decoupling any insertionsequence (i.e., the peptide) from the scaffold structure itself, tominimize local distortions in the scaffold structure that can eitherdestabilize folding intermediates, or allow access to GFPs' buriedtripeptide fluorophore, which decreases (or eliminates) rGFP or pGFPfluorescence due to exposure to exogenous collisional fluorescencequenchers (see Phillips, G. N. (1997) Curr. Opin Struct. Biol. 7:821–27, hereby incorporated by reference in its entirety).

Accordingly, as outlined below, when the peptides are inserted intointernal positions in the rGFP or pGFP protein, preferred embodimentsutilize linkers, and preferably (Gly)_(n) linkers, where n is 1 or more,with n being two, three, four, five and six, although linkers of 7–10 ormore amino acids are also possible. Generally in this embodiment, noamino acids with, β-carbons are used in the linkers.

In addition, the fusion partners, including presentation structures, maybe modified, randomized, and/or mutated to alter the presentationorientation of the randomized expression product. For example,determinants at the base of the loop may be modified to slightly modifythe internal loop peptide tertiary structure, to properly display theprotein or peptide of interest.

In a preferred embodiment, combinations of fusion partners are used.Thus, for example, any number of combinations of peptides of interest,presentation structures, targeting sequences, rescue sequences, andstability sequences may be used, with or without linker sequences. Aswill be appreciated by those in the art, using a base vector thatcontains a cloning site for inserting various peptides, a person skilledin the art can cassette in various fusion partners. In addition, asdiscussed herein, it is possible to have more than one peptide ofinterest in a construct, either together to form a new surface or tobring two other molecules together. Similarly, as described below, it ispossible to have peptides inserted at two or more different loops of therGFP or pGFP protein, preferably but not required to be on the same“face” of the GFP protein.

In view of the foregoing, the present invention further relates tofusion nucleic acids for encoding and and expressing the proteinsdescribed above. By “fusion nucleic acid” herein is meant a plurality ofnucleic acid components that are joined together, either directly orindirectly. As will be appreciated by those in the art, in someembodiments the sequences described herein may be DNA, for example whenextrachromosomal plasmids are used, or RNA when retroviral vectors areused. In some embodiments, the sequences are directly linked togetherwithout any linking sequences while in other embodiments linkers such asrestriction endonuclease cloning sites, linkers encoding flexible aminoacids, such as glycine or serine linkers such as known in the art, areused, as discussed above. In addition, the fusion nucleic acids mayfurther comprise substitutions to codon optimize the nucleic acid forexpression of the encoded proteins in a particular target organism.

To facilitate the generation of fusion polypeptides comprising rGFP orpGFP, the present invention further provides for rGFP or pGFP fusionnucleic acids with multiple cloning site (MCS) inserted into the rGFP orpGFP nucleic acid sequences at about the amino terminal region, thecarboxy terminal region, or at least one loop as oultined above, orcombinations thereof. The presence of an MCS facilitates generation offusion constructs, including cDNA, genomic DNA, and random peptidefusion libraries. When the MCS site is at the amino terminal region, theMCS may contain its own translation initiation sequence to regulatetranslation of inserted nucleic acids lacking its own translationinitiation sites (e.g., random peptide sequences). Alternatively, whenthe MCS is present downstream of the initiating amino acid (i.e.,methionine) near the amino terminal region, or at the carboxy terminalor internal loops of rGFP or pGFP, the translation initiation sequencesof rGFP or pGFP are generally used.

In the present invention, the fusion nucleic acids further compriseexpression vectors for expressing the proteins of the present invention.The expression vectors may be either self-replicating extrachromosomalvectors or vectors which integrate into a host genome. Generally, theseexpression vectors include control sequences operably linked to thenucleic acid encoding the protein. The term “control sequences” refersto DNA sequences necessary for the expression of an operably linkedcoding sequence in a particular host organism. Thus, control sequencesinclude sequences required for transcription and translation of thenucleic acids, which are selected in reference to the target organismused for expressing the proteins. For example, for prokaryotes, thesequences include a promoter, optionally an operator sequence, and aribosome binding site. Eukaryotic cells are known to utilize promoters,polyadenylation signals, and enhancers.

Nucleic acid is “operably linked” when it is placed into a functionalrelationship with another nucleic acid sequence. In the present context,operably linked means that the control sequences, such as transcriptionand translation regulatory sequences, are positioned relative to thecoding sequence in such a manner that expression of the encoded proteinoccurs. For example, a promoter or enhancer is operably linked to acoding sequence if it affects the transcription of the sequence; or aribosome binding site is operably linked to a coding sequence if it ispositioned so as to facilitate translation. Where the fusion nucleicacid encodes a fusion protein, for example a protein linked to asecretory leader sequence, the DNA for the secretory leader is operablylinked to DNA for a polypeptide if it is expressed in a manner resultingin secretion of the polypeptide.

In general, the transcriptional and translational regulatory sequencesmay include, but are not limited to, promoter sequences, enhancer ortranscriptional activator sequences, ribosomal binding sites, CAPsequences, transcriptional start and stop sequences, and translationalstart and stop sequences. In a preferred embodiment, the regulatorysequences include a promoter and transcriptional start and stopsequences.

Promoter sequences are either constitutive or inducible promoters. By“promoter” herein is meant nucleic acid sequences capable of initiatingtranscription of the fusion nucleic acid or portions thereof. Promotersmay be constitutive wherein the transcription level is constant andunaffected by modulators of promoter activity. Promoter may be induciblein that promoter activity is capable of being increased or decreased,for example as measured by the presence or quantitation of transcriptsor translation products (see Walter, W. et al. (1996) J. Mol. Med. 74:379–92). Promoters may also be cell specific wherein the promoter isactive only in particular cell types. Thus, promoter as defined hereinincludes sequences required for initiating and regulating thetranscription level and transcription in specific cell types.Furthermore, the promoters may be either naturally occurring promoters,hybrid promoters which combine elements of more than one promoter, orsynthetic promoters based on consensus sequence of known promoters.

The fusion nucleic acid comprising the expression vector may compriseadditional elements. For example, the expression vector may have tworeplication systems, thus allowing it to be maintained in two organisms,for example in mammalian or insect cells for expression and in aprokaryotic host for cloning and amplification. Furthermore, forintegrating into the host chromosomal elements, the expression vectormay contain sequences necessary for the integration process. Theintegration sequences used will depend on the integration mechanism. Forhomologous recombination, a sequence homologous to specific regions of ahost cell genome is incorporated into the fusion nucleic, as is wellknown in the art. Preferably two homologous sequences flank theexpression construct or the region to be inserted into the genome. Byselecting the appropriate homologous sequence, the vector may bedirected to specific regions of the host cell genome. Alternatively,integration is directed by inclusion of sequences necessary for sitespecific recombination. A variety of site specific recombination systemsare known. The cre-lox system comprises the Cre recombinase ofbacteriophage P1, which catalyzes recombination between short 34basepair Iox-P sites. Presence of Iox-P sites on two different DNAsresults in recombination between the two Iox-P sites, thus generating asingle recombinant containing two Iox-P sites flanking the integratedDNA (see for example, Fukushige, S. et al. (1992) Proc. Natl. Acad. Sci.USA 89: 7905–09). Cre-Iox recombinations can function in any cell systemcontaining Iox-P sites and Cre recombinase. Insertion of Iox-P sitesinto the genome of organisms and expression of Cre allows forrecombination events in bacterial, yeast, plant, and mammalian cells(Sauer, B. (1996) Nucleic Acids Res. 24: 4608–13; Araki, K. et al.(1997) Nucleic Acids Res. 25: 868–72; and Vergunst, A. C. (1998) PlantMol. Biol. 38: 393–406; U.S. Pat. No. 4,959,317).

Other systems applicable for integrating the expression vectors include,but are not limited to, the flp recombinase system (see for example,U.S. Pat. No. 6,140,129), the λ integrase system, bacteriophage phageMu, transposon systems (e.g., γδ), retroviral vectors, and the like. Assome of the integration mechanisms function only in certain organisms,the appropriate integration system is selected according to the cells inwhich the expression vectors are used, as is well known in the art.

In another preferred embodiment, the site-specific recombination sitesare not used for integration but for deletion or rearrangement ofnucleic acid sequences on the fusion nucleic acid. Suitable sitespecific recombination sequences include cre-10× and flp. Rearrangementsmay occur for fusion nucleic acids present extrachromosomally or forfusion nucleic acids integrated into the host chromosome. Generally, thesite-specific recombination sequences flank the nucleic acid sequencesselected for deletion or rearrangement. That is, a first site-specificsequence is present 5′ and a second site specific sequence is present 3′of the sequence to be deleted or rearranged. Thus, the sites may flankpromoter or promoter controlling elements, genes of interest, splicingsequences, translational controlling elements, or combinations thereof.Whether the site specific recombination sequences lead to deletion orrearrangement generally depend on the orientation of the recombinationsites. Placement of flp or loxP sites in head-to-head orientation (i.e.,inverted repeat) results in inversion of the interlying DNA whileplacement in head-to-tail orientation (i.e., direct repeat) results inexcision of the interlying DNA. These features may be useful in severalsituations, for example, when it is desirable to activate expression ofthe rGFP or pGFP fusion polypeptide in specific cells, tissues, or atspecific periods, especially at specific times in cellular development.To achieve this effect, a rGFP or pGFP fusion nucleic acid, flanked byloxP or flp sites placed in inverse repeat orientation, is linked in areverse orientation relative to a promoter such that transcriptionresults in generation of an antisense strand rather than the sensestrand of the fusion nucleic acid encoding the fusion polypeptide, thusresulting in absence of rGFP or pGFP protein. To properly express rGFPor pGFP protein in these cells, the recombinase is expressed in thesecells, either by transfection or by inducing expression of an endogenouscopy of the recombinase, which results in inversion of the rGFP or pGFPrelative to the promoter. This rearrangement places the gene in properorientation for synthesis of the sense strand that leads to expressionof the protein.

In a preferred embodiment, the expression vector also contains aselectable marker gene to allow the selection of transformed host cells.Generally, the selection will confer a detectable phenotype thatprovides a way of differentiating between cells that express and do notexpress the selection gene. Selection genes are well known in the artand will vary with the host cell used, as further described below.

In accordance with the foregoing, a variety of expression vectors areused to express the nucleic acids encoding the proteins of the presentinvention. As used herein, the term “vector” includes plasmids, cosmids,artificial chromosomes, viruses, and the like. In one preferredembodiment, the expression vectors are bacterial expression vectorsincluding vectors for Bacillus subtilis, E. coli, Haemophilus,Streptococcus cremoris, and Streptococcus lividans, among others. Thesevectors are well known in the art. A suitable bacterial promoter is anynucleic acid sequence capable of binding bacterial RNA polymerase andinitiating the downstream (3′) transcription of the coding sequence ofthe fusion protein into mRNA. A bacterial promoter has a transcriptioninitiation region which is usually placed proximal to the 5′ end of thecoding sequence. This transcription initiation region typically includesan RNA polymerase binding site and a transcription initiation site.Sequences encoding metabolic pathway enzymes provide particularly usefulpromoter sequences. Examples include promoter sequences derived fromsugar metabolizing enzymes, such as galactose, lactose and maltose, andsequences derived from biosynthetic enzymes such as tryptophan.Promoters from bacteriophage (e.g., pL) may also be used and are knownin the art. In addition, synthetic promoters and hybrid promoters arealso useful; for example, the tac promoter is a hybrid of the trp andlac promoter sequences. Furthermore, a bacterial promoter can includenaturally occurring promoters of non-bacterial origin that have theability to bind bacterial RNA polymerase and initiate transcription.

In addition to a functioning promoter sequence, an efficient ribosomebinding site is desirable. In E. coli, the ribosome binding site is theShine-Delgarno (SD) sequence and includes an initiation codon and asequence 3–9 nucleotides in length located 3–11 nucleotides upstream ofthe initiation codon.

The expression vector may also include a signal peptide sequence thatprovides for secretion of the fusion protein in bacteria. The signalsequence typically encodes a signal peptide comprised of hydrophobicamino acids which direct the secretion of the protein from the cell, asis well known in the art. The protein is either secreted into the growthmedia (gram-positive bacteria) or into the periplasmic space, locatedbetween the inner and outer membrane of the cell (gram-negativebacteria).

The bacterial expression vector may also include a selectable markergene to allow for the selection of bacterial strains that have beentransformed. Suitable selection genes include genes which render thebacteria resistant to drugs such as ampicillin, chloramphenicol,erythromycin, kanamycin, neomycin and tetracycline. Selectable markersalso include biosynthetic genes, such as those in the histidine,tryptophan and leucine biosynthetic pathways. These components areassembled into expression vectors and introduced in bacterial hostcells, using techniques well known in the art (e.g., calcium chloridetreatment, electroporation, etc.).

In another preferred embodiment, the expression vectors are used toexpress the proteins in yeast cells. Yeast expression systems are wellknown in the art, and include expression vectors for Saccharomycescerevisiae, Candida albicans and C. maltosa, Hansenula polymorpha,Kluyveromyces fragilis and K. lactis, Pichia guillerimondii and P.pastoris, Schizosaccharomyces pombe, and Yarrowia lipolytica. Preferredpromoter sequences for expression in yeast include the inducible GALpromoters (e.g., GAL 1, GAL 4, GAL 10. etc.), the promoters from alcoholdehydrogenase (ADH or ADC1), enolase, glucokinase, glucose-6-phosphateisomerase, glyceraldehyde-3-phosphate-dehydrogenase, hexokinase,phosphofructokinase, 3-phosphoglycerate mutase, pyruvate kinase,fructose bisphosphate, acid phosphatase gene, tryptophase synthase(TRP5) and copper inducible CUP1 promoter. Any plasmid containing ayeast compatible promoter, an origin of replication, and terminationsequences is suitable.

Yeast selectable markers include genes complementing mutations ADE2,HIS4, LEU2, TRP1, URA3, and genes conferring resistance to tunicamycin(ALG7 gene), G418 (neomycin phosphotransferase gene), growth in presenceof copper ions (metallothionein CUP1 gene), resistance to fluoroacetate,(fluoroacetate dehalogenase), or resistance to formaldehyde(formaldehyde dehydrogenase).

In another preferred embodiment, the expression vectors are used forexpression in plants. Plant expression vectors are well known in theart. Vectors are known for expressing genes in Arabidopsis thaliana,tobacco, carrot, and maize and rice cells. Suitable promoters for use inplants include those of plant or viral origin, including, but notlimited to CaMV 35S promoter (active in both monocots and dicots,Chapman, S. et al. (1992) Plant J. 2, 549–557) nopoline promoter,mannopine synthase promoter, soybean or Arabidopsis thaliana heat shockpromoters, tobacco mosaic virus promoter (Takmatsu, et al. (1987) EMBOJ. 6: 307), AT2S promoters of Arabidopsis thaliana (i.e., PAT2S1, PATS2,PATS3 etc.). In another preferred embodiments, the promoters are tissuespecific promoters active in specific plant tissues or cell types (e.g.,roots, leaves, shoot meristem etc.), which are well known in the art.Alternatively, the expression vectors comprise recombinant plasmidexpression vectors based on Ti plasmids or root inducing plasmids.

In another aspect, regulatory sequences include “enhancers” to regulateexpression. Preferably these are of plant, bacterial (e.g.Agrobacterium), viral origin which are specific to plants. The enhancersmay act at either the transcriptional or translational level. The fusionnucleic acids may also comprise one or more introns, preferably of plantorigin, to increase the efficiency of expression of the fusion nucleicacid. For example, insertion of an intron into the 5′ untranslatedsequence of a gene (e.g., between site of transcription initiation andtranslation initiation) leads to increased stability of the messengerRNA. The intron is preferrably, though not necessarily, the firstintron.

Optionally, a selectable marker gene is used with the expressionvectors. The marker may be a drug resistance gene, a herbicideresistance gene, or any other selectable marker that can be used forselecting cells containing the vector. Suitable plant markers includeadenosine deaminase, dihydrofolate reductase, hygromycin transferase,bar gene (Lohar, D. P. (2001) J. Exp. Bot. 52: 1697–702), greenfluorescent proteins (including rGFP and pGFPs of the presentinvention), amino-glycoside 3′-O-phosphotransferase 11 (i.e., kanamycin,neomycin, and G418 resistance).

In addition, the plant expression vectors may comprise plant specifictargeting sequences in addition to the targeting sequences describedabove. In one aspect, the sequences are chloroplast or mitochondrialtargeting sequences. An example of a chloroplast targeting signals isthe small subunit of ribulose 1,5 diphosphate of Pisum sativum. For amitochondrial targeting sequence, an example is the precursor of thebeta subunit of mitochondrial ATPase F1 of Nicotiana plumbaginifolia. Inanother aspect, the targeting signal comprises a vacuolar targetingsequences or “propeptide”. These sequences target the proteins tovacuoles of aqueous tissues, including leaves or protein bodies ofstorage tissues (Neuhaus, J. M et al. (1991) Proc. Natl. Acad. Sci. USA88: 10362–66; and Sebastiani, F. L. et al. (1991) Eur. J. Biochem. 199:441–50).

In another preferred embodiment, the expression vectors are used toexpress the proteins and nucleic acids of the present invention ininsects and insect cells. In one preferred embodiment, fusion proteinsare produced in insect cells. Expression vectors for the transformationof insect cells, and in particular, baculovirus vectors used to createrecombinant baculoviruses for expressing foreign genes, are well knownin the art (see for example, O'Reilly, D. R. et al. “BaculovirusExpression Vectors: A Laboratory Manual,” W. H. Freeman & Co, New York,1992). By “baculovirus” or “nuclear polyhedrosis viruses” as used hereinis meant expression systems using viruses classified under the family ofbaculoviridae, preferably subgroup A. In preferred embodiments, theseinclude systems specific for Bombix, Autographica, and Spodoptera (seefor example, U.S. Pat. No. 5,194,376). Other expression systems includeAmsacta moorei entomopoxvirus (AmEPV), Aedes aegypti desonucleosis(Aedes DNV, U.S. Pat. No. 5,849,523), and Galleria mellonella densovirus(GmDNV, Tal, et al. (1993) Arch. Insect Biochem. Physiol. 22: 345–356).In another preferred embodiment, expression vectors comprise fusionnucleic acids that integrate into the host chromosome. This may beachieved by homologous recombination, particularly modified homologousrecombination techniques when the insect cells or insect do not readilyundergo homologous recombination (see Rong, Y. S. (2000) Science 288:2013–18); site directed recombination (e.g., cre-lox); and transposonmediated integration (e.g., P-element transposition elements).

Promoters suitable for controlling expression in insects includeAutographa californica nuclear polyhdrosis virus polyhedrin promoter,heat shock promoter (e.g., hsp 70), tubulin promoter, p10 promoter,Aedes DNV viral p7 and p61 promoters. In one preferred embodiment, thepromoter allows expression at an early stage in viral infection and/orallows expression in substantially all tissues of an insect. In anotherpreferred embodiment, the promoter is a cell specific and developmentalstage specific promoter, many of which are well known in the art. Asused herein, developmental specific promoters are promoters that areactive at only certain stages in insect development, for example,embryonic, larval, pupal, and adult stages. An example of adevelopmental stage specific promoter is the ecdysone regulatedpromoters that are active during molting and larval/pupal stages becauseof increases in the steroid hormone ecdysone during these developmentalperiods. Cell specific promoters include promoters active in the nervoussystem (e.g., ELAV), imaginal discs, gut, malphigian tubules, antennae(e.g,. odor binding protein gene promoter), etc.

Although mammalian targeting sequences function in insect cells,targeting sequences derived from insect genes are preferred under somecircumstances, for example to efficiently express secreted or membranebound proteins in insect cells. Signal sequence include Manduca sextaAKH signal peptide sequence, Drosophila cuticle protein signal peptides(e.g., CP1, CP2, CP3 and CP4, U.S. Pat. No. 5,278,050), and honey beemellitin excretion peptide (MKFLVDVALVFMWYISYIYA) (SEQ ID NO: 74).

In a preferred embodiment, the expression vectors are used forexpression in animals, especially mammals. A variety of expressionvectors are known for expressing proteins in animal cells, includingfusion nucleic acids existing extrachromosomally, as integrants in thehost chromosome, or as viral nucleic acids. Viral vectors may be basedon adenoviral, lentiviral, alphaviral, poxvirus (vaccinia virus), orretroviral vectors. In a preferred embodiment, the viral expressionvector system is a retroviral vector, such as is generally described inPCT/US97/01019 and PCT/US97/01048, both of which are hereby expresslyincorporated by reference.

By “retroviral vectors” herein is meant vectors used to introduce intoappropriate hosts the nucleic acids of the present invention in the formof a RNA viral particle. A variety of retroviral vectors are known inthe art. Preferred retroviral vectors include a vector based on themurine stem cell virus (MSCV) (Hawley, R. G. et al. (1994) Gene Ther. 1:136–38) and a modified MFG virus (Riviere, I. et al. (1995) Proc. Natl.Acad. Sci. USA 92: 6733–37), and pBABE (see PCT US97/01019). Inaddition, particularly well suited retroviral transfection systems forgenerating retroviral vectors are described in Mann et al., supra; Pear,W. S. et al. (1993) Proc. Natl. Acad. Sci. USA 90: 8392–96; Kitamura, T.et al. (1995) Proc. Natl. Acad. Sci. USA 92: 9146–50; Kinsella, T. M. etal. (1996) Hum. Gene Ther. 7: 1405–13; Hofmann, A. et al. (1996) Proc.Natl. Acad. Sci. USA 93: 5185–90; Choate, K. A. et al. (1996) Hum. GeneTher. 7: 2247–53; WO 94/19478; PCT US97/01019, and references citedtherein, all of which are incorporated by reference. Other suitableretroviral vectors include, among others, LRCX retroviral vector set;pSIR retroviral vector; pLEGFP-NI retroviral vector, pLAPSN retroviralvector; pLXIN retroviral vector; pLXSN retroviral vector; all of whichare commercially available (e.g., Clontech, Palo Alto, Calif.).Generally, the retroviral vectors described above are used to expressthe nucleic acids of the present invention in proliferating cells. Whentarget cells are non-proliferating (e.g., brain cells), useful viralvectors are derived from lentiviruses (Miyoshi, H. et al. (1998) J.Virol. 72: 8150–57), adenoviruses (Zheng, C. et al. (2000) Nat.Biotechnol. 18: 176–80) or alphaviruses (Ehrengruber, M. U. (1999) Proc.Natl. Acad. Sci. USA 96: 7041–46). In addition, the retroviral vectorsmay incorporate the self-inactivating (SIN) feature of 3′ LTRenhancer/promoter to inactivate viral promoters upon integration, whichallows use of other promoters for regulating expression of the fusionnucleic acid. It is possible to configure these SIN retroviral vectorsto permit inducible expression of retroviral inserts after integrationof a single vector into a target cell (Hofmann, et al. (1996) Proc.Natl. Acad. Sci. USA 93: 5185).

The mammalian vectors may include inducible and constitutive promotersfor expressing the genes of interest encoding the polypeptides of thepresent invention. A mammalian promoter will have a transcriptioninitiating region, generally located 5′ to the start of the codingregion, and a TATA box, present at about 25–30 basepairs upstream of thetranscription initiation site. The promoter will also contain upstreamregulatory elements that control the rate and initiation oftranscription, including CAAT and GC box, enhancer sequences, andrepressor/silencer sequences (see for example, Chang BD (1996) Gene 183:137–42). These promoter controlling elements may act directionally,requiring placement upstream of the promoter region, or actnon-directionally. These aforementioned transcriptional controlsequences may be provided from non-viral or viral sources. Commonly usedpromoters and enhancers are from viral sources since the viral geneshave a broad host range and produce high expression rates. Viralpromoters, including upstream controlling sequences, may be from polyomavirus, adenovirus 2, simian virus 40 (early and late promoters), herpessimplex virus (e.g., HSV thymidine kinase promoter), humancytomegalovirus promoter (CMV), and mouse mammary tumor virus (MMTV-LTR)promoter. A variety of non-viral promoters with constitutive, inducible,cell specific, or developmental stage specific activities are also wellknown in the art (e.g., β-globin promoter, mammalian heat shockpromoter, metallothionein, ubiquitin C promoters, EF-1 alpha promoters,etc.). Cell specific promoters, which are well known in the art, includepromoters active in specific cells including, but not limited to brain,olfactory bulb, thyroid, lung, muscle, pancreas, liver, lung, heart,breast, prostate, kidney, etc. Promoters and promoter controllingelements are chosen based on the desired level of promoter activity andthe cell type in which the proteins of the present invention are to beexpressed.

Generally, the mammalian vectors also include selectable marker genes.Suitable marker genes include reporter or selection genes as furtherdescribed below. Selection genes include, but are not limited toneomycin, blastocidin, bleomycin, puromycin, hygromycin, and multipledrug resistance (MDR) genes. Suitable reporter genes include fluorescentproteins (e.g., green fluorescent proteins, luciferases) enzymaticmarkers (e.g., b-galactosidase, glucouronidase, alkaline phosphataseetc.), and surface proteins (e.g., CD8).

Additional sequences in the expression vectors include splice sites forproper expression, polyadenylation signals, 5′ CAP sequence,transcription termination sequences, and the like. Typically,transcription termination and polyadenylation sequences recognized bymammalian cells are regulatory regions located 3′ to the translationstop codon and thus, together with the promoter elements, flank thecoding sequence. The 3′ terminus of the mature mRNA is formed bysite-specific post-transcriptional cleavage and polyadenylation.Examples of transcription terminator and polyadenylation signals includethose derived from SV40.

Other sequences may include centromere sequences for generating humanartificial chromosomes (HACs) for delivering larger fragments of DNAthan can be contained and expressed in a plasmid or viral vector. HACsof 6 to 10M bp are constructed and delivered via conventional deliverymethods (liposomes, polycationic amino polymers, or vesicles) fortherapeutic purposes. The choice and design of an appropriate vector iswithin the ability and discretion of one of ordinary skill in the art.

In a further preferred embodiment, the fusion nucleic acids of thepresent invention may comprise a first gene of interest, a separationsequence, and a second gene of interest. In a preferred embodiment, atleast one of the gene of interest is a rGFP or pGFP or their variants,or a rGFP or pGFP fusion polypeptide described above. By “gene ofinterest” herein is meant any nucleic acid sequence capable of encodinga “protein of interest” or a “protein,” as defined below. However, insome embodiments, the “gene of interest” encompasses a nucleic acidsequence element that does not encode a protein. These elements mayinclude, but are not limited to, promoter/enhancer elements, chromatinorganizing sequences, ribosome binding sequences, mRNA splicingsequences, multiple cloning sites, etc.

In a preferred embodiment, the gene of interest is a reporter gene. By“reporter gene” or “selection gene” or grammatical equivalents herein ismeant a gene that by its presence in a cell (i.e., upon expression)allows the cell to be distinguished from a cell that does not containthe reporter gene. Reporter genes can be classified into severaldifferent types, including detection genes, survival genes, death genes,cell cycle genes, cellular biosensors, proteins producing a dominantcellular phenotype, and conditional gene products. In the presentinvention, expression of the protein product causes the effectdistinguishing between cells expressing the reporter gene and those thatdo not. As is more fully outlined below, additional components, such assubstrates, ligands, etc., may be additionally added to allow selectionor sorting on the basis of the reporter gene.

In a preferred embodiment, the first and second gene of interest encodethe same rGFP or pGFP. These constructs allow increased expression ofthe GFP molecule or GFP fusion polypeptide since two copies of the samegene are expressed in a single transcriptional event. The presence of aseparation sequence allows the synthesis of separate fluorescentproteins, thus obviating any detrimental effect that might arise fromfusing two reporter proteins to each other. Synthesizing high levels ofencoded protein is desirable when needed to produce a cellularphenotype, for example when expressing a random peptide fused to rGFP orpGFP. Similarly, for example when screening for promoter regulators,signal amplification may be accomplished by expressing two identicalrGFP or pGFP reporter genes.

In another preferred embodiment, the gene of interest comprises areporter gene distinguishable from rGFP or pGFP. Expressing twodistinguishable, separate reporter proteins allows targeting ofindividual reporter proteins to distinct cellular locations, providesincreased discrimination of cells expressing the fusion nucleic acid,and affords a basis for monitoring expression of the other reportergene.

In a preferred embodiment, the distinguishable reporter gene comprises aprotein that can be used as a direct label, for example a detection genefor sorting the cells or for cell enrichment by FACS. In thisembodiment, the protein product of the reporter gene itself can serve todistinguish cells that are expressing the reporter gene. In one aspect,suitable reporter genes include distinguishable wildtype and variantforms of Renilla reniformis GFP, Ptilosarcus gurneyi GFP, and Renillamuelleri GFP. In another aspect, the reporter gene comprises otherfluorescent proteins, such as Aequoria victoria GFP (Chalfie, M. et al.(1994) Science 263: 802–05), EGFP; Clontech—Genbank Accession NumberU55762), blue fluorescent protein (BFP; Quantum Biotechnologies, Inc.1801 de Maisonneuve Blvd. West, 8th Floor, Montreal (Quebec) Canada H3H1J9; Stauber, R. H. (1998) Biotechniques 24: 462–71; Heim, R. et al.(1996) Curr. Biol. 6: 178–82), enhanced yellow fluorescent protein(EYFP; 1. Clontech Laboratories, Inc., 1020 East Meadow Circle, PaloAlto, Calif. 94303), Anemonia majano fluorescent protein (amFP486, Matz,M. V. (1999) Nat. Biotech. 17: 969–73), Zoanthus fluorescent proteins(zFP506, zFP538; Matz, supra), Discosoma fluorescent protein (dsFP483,drFP583; Matz, supra), and Clavularia fluorescent protein (cFP484; Matz,supra). Other suitable reporter genes include, among others, luciferases(for example, firefly, Kennedy, H. J. et al. (1999) J. Biol. Chem.274:13281–91; Renilla reniformis, Lorenz, W. W. (1996) J Biolumin.Chemilumin. 11: 31–37; Renilla muelleri, U.S. Pat. No. 6,232,107),β-galactosidase (Nolan, G. et al. (1988) Proc. Natl. Acad. Sci. USA 85:2603–07), β-glucouronidase (Jefferson, R. A. et al. (1987) EMBO J. 6:3901–07; Gallager, S., “GUS Protocols: Using the GUS Gene as a reporterof gene expression,” Academic Press, Inc., 1992), horseradishperoxidase, alkaline phosphatase, and SEAP (i.e., the secreted form ofhuman placental alkaline phosphatase; Cullen, B. R. et al. (1992)Methods Enzymol. 216: 362–68).

In another embodiment, the reporter gene encodes a protein that willbind a label that can be used as the basis of the cell enrichment(sorting); that is, the reporter gene serves as an indirect label ordetection gene. In a preferred embodiment, the reporter gene encodes acell-surface protein. For example, the reporter gene may be anycell-surface protein not normally expressed on the surface of the cell,such that secondary binding agents serve to distinguish cells thatcontain the reporter gene from those that do not. Alternatively, albeitnon-preferably, reporters comprising normally expressed cell-surfaceproteins could be used, and differences between cells containing thereporter construct and those without could be determined. Thus,secondary binding agents bind to the reporter protein. These secondarybinding agents are preferably labeled, for example with fluors, and canbe antibodies, haptens, etc. For example, fluorescently labeledantibodies to the reporter gene can be used as the label. Similarly,membrane-tethered streptavidin could serve as a reporter gene, andfluorescently-labeled biotin could be used as the label, i.e., thesecondary binding agent. Alternatively, the secondary binding agentsneed not be labeled as long as the secondary binding agent can be usedto distinguish the cells containing the construct; for example, thesecondary binding agents may be used in a column, and the cells passedthrough, such that the expression of the reporter gene results in thecell being bound to the column, and a lack of the reporter gene (i.e.inhibition), results in the cells not being retained on the column.Other suitable reporter proteins/secondary labels include, but are notlimited to, antigens and antibodies, enzymes and substrates (orinhibitors), etc.

In a preferred embodiment, the reporter gene comprises a survival genethat serves to provide a nucleic acid without which the cell cannotsurvive, such as drug resistance genes. In this embodiment, expressingthe survival gene allows selection of cells expressing the fusionnucleic acid by identifying cells that survive, for example in presenceof a selection compound. Examples of drug resistance genes include, butare not limited to, puromycin resistance(puromycin-N-acetyl-transferase) (de la Luna, S. et al. (1992) MethodsEnzymol. 216: 376–85), G418 neomycin resistance gene, hygromycinresistance gene (hph), and blasticidine resistance genes (bsr, brs, andBSD; Pere-Gonzalez, et al. (1990) Gene, 86: 129–34; Izumi, M. et al.(1991) Exp. Cell Res. 197: 229–33; Itaya, M. et al. (1990) J. Biochem.107: 799–801; Kimura, M. et al. (1994) Mol. Gen. Genet. 242: 121–29). Inaddition, generally applicable survival genes are the family ofATP-binding cassette transporters, including multiple drug resistancegene (MDR1) (see Kane, S. E. et. al. (1988) Mol. Cell. Biol. 8: 3316–21and Choi, K. H. et al. (1988) Cell 53: 519–29), multi-drug resistanceassociated proteins (MRP) (Bera, T. K. et al. (2001) Mol. Med. 7:509–16), and breast cancer associated protein (BCRP or MXR) (Tan, B. etal. (2000) Curr. Opin. Oncol. 12: 450–58). When expressed in cells,these selectable transporter genes can confer resistance to a variety oftoxic reagents, especially anti-cancer drugs (i.e., methotrexate,colchicine, tamoxifen, mitoxanthrone, and doxorubicin). As will beappreciated by those skilled in the art, the choice of theselection/survival gene will depend on the host cell type used.

In a preferred embodiment, the reporter gene comprises a death gene thatcauses the cells to die when expressed. Death genes fall into two basiccategories: death genes that encode death proteins requiring a deathligand to kill the cells, and death genes that encode death proteinsthat kill cells as a result of high expression within the cell and donot require the addition of any death ligand. Preferred are cell deathmechanisms that require a two-step process: the expression of the deathgene and induction of the death phenotype with a signal or ligand suchthat the cells may be grown expressing the death gene, and then inducedto die. A number of death genes/ligand pairs are known, including, butnot limited to, the Fas receptor and Fas ligand (Schneider, P. et al.(1997) J. Biol. Chem. 272: 18827–33; Gonzalez-Cuadrado, S. et al. (1997)Kidney Int. 51: 1739–46; Muruve, D. A. et al. (1997) Hum. Gene Ther. 8:955–63); p450 and cyclophosphamide (Chen, L. et al. (1997) Cancer Res.57: 4830–37); thymidine kinase and gangcylovir (Stone, R. (1992) Science256: 1513); and tumor necrosis factor (TNF) receptor and TNF.

When death genes requiring ligands are used, preferred embodiments usechimeric death genes (i.e, chimeric death receptor genes). Chimericdeath receptors may comprise the extracellular domain of aligand-activated multimerizing receptor and the endogenous cytoplasmicdomain of a death receptor gene, such as Fas or TNF. This avoidsendogenous activation of the death gene. Thus, in one embodiment,substituting the extracellular portion of a death receptor, such as Fas,with the extracellular portion of another ligand activated multimerizingreceptor provides a basis for using a completely different signal toactivate cell death. Suitable ligand-activated dimerizing receptorsinclude, but are not limited to, the CD8 receptor, erythropoeitinreceptor, thrombopoietin receptor, growth hormone receptor, Fasreceptor, platelet derived growth hormone receptor, epidermal growthfactor receptor, leptin receptor, and various interleukin receptors(e.g., IL-1, IL-2, IL-3, IL-4, IL-5, IL-6, IL-7, IL-8, IL-9, IL-11,IL-12, IL-13, IL-15, and IL-17). When particular receptors are employedto modulate promoter activity, these receptors (e.g., IL-4 whenexamining IL-4 induced promoter activity) are not preferred for use as achimeric death gene receptor.

In a preferred embodiment, the chimeric cell death receptor genes arechimeric Fas receptors. The exact combination will depend on the celltype used and the receptors normally produced by these cells. Forillustration, when the cells are human cells, a non-human extracellulardomain and a human cytosolic domain are preferred to prevent endogenousinduction of the death gene. Thus, when human cells are used, apreferred chimeric receptor gene may comprise a murine extracellular Fasreceptor domain and a human cytosolic domain, such that the endogenoushuman Fas ligand will not activate the murine receptor domain.Alternatively, human extracellular domains may be used when the cells donot endogenously produce the cognate ligand. For example, human EPOextracellular domain may be used when cells do not endogenously produceEPO (Kawaguchi, Y. et al. (1997) Cancer Lett. 116: 53–59; Takebayashi,H. et al. (1996) Cancer Res. 56: 4164; Rudert, F. et al. (1994) BiochemBiophys Res Commun. 204: 1102–10; Takahashi, T. et al. (1996) J. Biol.Chem. 271: 17555–60). In another aspect, the extracellular domains arecombinations of different extracellular domains that form functionalreceptors (Mares, et al. (1992) Growth Factors, 6: 93–101; Seedorf, K.et al. (1991) J. Biol. Chem. 266: 12424–31; Heidaran, M. A. et al.(1990) J. Biol. Chem. 265: 18741–44; Okuda, K. et al. (1997) J. Clin.Invest. 100: 1708–15; Anders, R. A. et al. (1996) J. Biol. Chem. 271:21758–66; Krishnan, K. et al. (1996) Oncogene, 13:125–33; Ohashi, et al.(1994) Proc. Natl. Acad. Sci. USA, 91: 158–62; and Amara, J. F. et al.(1997) Proc. Natl. Acad. Sci. USA 94: 10618–23. In general, the chimericdeath gene receptors have a transmembrane domain. As will be appreciatedby those skilled in the art, the transmembrane domain from any of thereceptors can be used, although it is preferable to use thetransmembrane domain associated with the chosen cytosolic domain topreserve the interaction of the transmembrane domain with otherendogenous signaling proteins (Declercq, W. et al. (1995) Cytokine 7:701–09).

Alternatively, the death genes are “one step” death genes, which neednot require a ligand and death results from high expression of the gene.These death genes kill a cell without requiring a ligand or secondarysignal. In one aspect, cell death is induced by the overexpression of anumber of programmed cell death (PCD) proteins known to cause celldeath, including, but not limited to, caspases, bax, TRADD, FADD, SCK,MEK, etc.

In another aspect, one step death genes also include toxins that causecell death, or impair cell survival or cell function when expressed by acell. These toxins generally do not require addition of a ligand toproduce toxicity. An example of a suitable toxin is campylobacter toxinCDT (Lara-Tejero, M. (2000) Science, 290: 354–57). Expression of CdtBsubunit, which has homology to nucleases, causes cell cycle arrest andultimately cell death. Another toxin, the diptheria toxin (and similarPseudomonas exotoxin), functions by ADP ribosylating the ef-2(elongation factor 2) molecule in the cell and preventing translation.Expression of the diptheria toxin A subunit induces cell death in cellsexpressing the toxin fragment. Other useful toxins include cholera toxinand pertussis toxin (catalytic subunit-A ADP ribosylates G proteins thatregulate adenylate cyclase), pierisin from cabbage butterflys (inducesapoptosis in mammalian cells; Watanabe, M. (1999) Proc. Natl. Acad. Sci.USA 96: 10608–13), phospholipase snake venom toxins (Diaz, C. et al.(2001) Arch. Biochem. Biophys. 391: 56–64), ribosome inactivating toxins(i.e. ricin A chain, Gluck, A. et al. (1992) J. Mol. Biol. 226: 411–24;and nigrin, Munoz, R. et al. (2001) Cancer Lett. 167: 163–69), and poreforming toxins (hemolysin and leukocidin). When the target cells areneuronal cells, neuronal specific toxins may be used to inhibit specificneuronal functions. These include bacterial toxins such as botulinumtoxin and tetanus toxin, which are proteases that act on synapticvesicle associated proteins (i.e., synaptobrevin) to preventneurotransmitter release (see Binz, T. et al. (1994) J. Biol. Chem. 269:9153–58; Lacy, D. B. et al. (1998) Curr. Opin. Struct. Biol. 8: 778–84).

Another preferred embodiment of a gene of interest is a cell cycle gene,that is, a gene that causes alterations in the cell cycle. For example,Cdk interacting protein p21 (see Harper, J. W. et al. (1993) Cell 75:805–16), which inhibits cyclin dependent kinases, does not cause celldeath but causes cell-cycle arrest. Thus, expressing p21 allowsselecting for regulators of promoter activity or regulators of p21activity based on detecting cells that grow out much more quickly due tolow p21 activity, either through inhibiting promoter activity orinactivation of p21 protein activity. As will be appreciated by those inthe art, it is also possible to configure the system to select cellsbased on their inability to grow out due to increased p21 activity.Similar mitotic inhibitors include p27, p57, p16, p15, p18 and p19, p19ARF (or its human homolog p14 ARF). Other cell cycle proteins useful foraltering cell cycle include cyclins (Cln), cyclin dependent kinases(Cdk), cell cycle checkpoint proteins (i.e., Rad17, p53), Cks1 p9, Cdcphosphatases (i.e., Cdc 25), etc.

In yet another preferred embodiment, the gene of interest encodes acellular biosensor. In these fusion nucleic acids, at least one of thegenes of interest may encode a rGFP or pGFP fusion polypeptide, which isitself a cellular biosensor, or the cellular biosensor may be expressedin addition to the rGFP or pGFP (or rGFP or pGFP fusion protein). By a“cellular biosensor” herein is meant a gene product that when expressedwithin a cell can provide information about a particular cellular state.Biosensor proteins allow rapid determination of changing cellularconditions, for example Ca⁺² levels in the cell, pH within cellularorganelles, and membrane potentials (see Miesenbock, G. et al. (1998)Nature 394: 192–95; U.S. Pat. No. 6,150,176). An example of anintracellular biosensor is Aequorin, which emits light upon binding toCa⁺² ions. The intensity of light emitted depends on the Ca⁺²concentration, thus allowing measurement of transient calciumconcentrations within the cell. When directed to particular cellularorganelles by fusion partners, as more fully described below, the lightemitted by Aequorin provides information about Ca⁺² concentrationswithin the particular organelle. Other intracellular biosensors arechimeric GFP molecules engineered for fluorescence resonance energytransfer (FRET) upon binding of an analyte, such as Ca⁺² (U.S. Pat. No.6,197,928; Miyawaki, A. et al. (1997) Nature 388: 882–87). For example,cameleon comprises a blue or cyan mutant of GFP, calmodulin, CaM bindingdomain of myosin light chain kinase, and a green or yellow GFP. Uponbinding of Ca⁺² by the CaM domain, FRET occurs between the two GFPsbecause of a structural change in the chimera. Thus, FRET intensity isdependent on the Ca⁺² levels within the cell or organelle (Kerr, R. etal. Neuron (2000) 26: 583–94). Other examples of intracellularbiosensors include sensors for detecting changes in cell membranepotential (Siegel, M. et al. (1997) Neuron 19: 735–41; Sakai, R. (2001)Eur. J. Neurosci. 13: 2314–18), monitoring exocytosis (Miesenbrock, G.et al. (1997) Proc. Natl. Acad. Sci. USA 94: 3402–07), and measuringintracellular/organellar ATP concentrations via luciferase protein(Kennedy, H. J. et al. (1999) J. Biol. Chem. 274: 13281–91). Thesebiosensors find use in monitoring the effects of various cellulareffectors, for example pharmacological agents that modulate ion channelactivity, neurotransmitter release, ion fluxes within the cell, andchanges in ATP metabolism.

Other intracellular biosensors comprise detectable gene products withsequences that are responsive to changes in intracellular signals. Thesesequences include peptide sequences acting as substrates for proteinkinases, peptides with binding regions for second messengers, andprotein interaction sequences sensitive to intracellular signalingevents (see for example, U.S. Pat. No. 5,958,713 and U.S. Pat. No.5,925,558). For example, a fusion protein construct comprising a GFP anda protein kinase recognition site allows detecting intracellular proteinkinase activity by measuring changes in GFP fluorescence arising fromphosphorylation of the fusion construct. Alternatively, the GFP is fusedto a protein interaction domain whose interaction with cellularcomponents are altered by cellular signaling events. For example, it iswell known that inositol-triphosphate (InsP3) induces release of Ca⁺²from intracellular stores into the cytoplasm, which results inactivation of a kinases responsible for regulating various cellularresponses. The precursor to InsP3 isphosphatidyl-inositol-4,5-bisphosphate (PtdInsP₂), which is localized inthe plasma membrane and cleaved by phospholipase C (PLC) followingactivation of an appropriate receptor. Many signaling enzymes aresequestered in the plasma membrane through pleckstrin homology domainsthat bind specifically to PtdInsP₂. Following cleavage of PtdInsP₂, thesignaling proteins translocate from the plasma membrane into the cytosolwhere they activate various cellular pathways. Thus, a reporter moleculesuch as GFP fused to a pleckstrin domain will act as a intracellularsensor for phospholipase C activation (see Haugh, J. M. et al. (2000) J.Cell. Biol. 15: 1269–80; Jacobs, A. R. et al. (2001) J. Biol. Chem. 276:40795–802; and Wang, D. S. et al. (1996) Biochem. Biophys. Res. Commun.225: 420–26). Other similar constructs are useful for monitoringactivation of other signaling cascades and are applicable as assays inscreens for candidate agents that inhibit or activate particularsignaling pathways.

Since protein interaction domains, such as the described pleckstrinhomology domain, are important mediators of cellular responses andbiochemical processes, other preferred genes of interest are proteinscontaining protein-interaction domains. By “protein-interaction domain”herein is meant a polypeptide region that interacts with otherbiomolecules, including other proteins, nucleic acids, lipids, etc.These protein domains frequently act to provide regions that induceformation of specific multiprotein complexes for recruiting andconfining proteins to appropriate cellular locations or affectspecificity of interaction with targets ligands, such as protein kinasesand their substrates. Thus, many of these protein domains are found insignaling proteins. Protein-interaction domains comprise modules ormicro-domains ranging about 20–150 amino acids that can be expressed inisolation and bind to their physiological partners. Many differentinteraction domains are known, most of which fall into classes relatedby sequence or ligand binding properties. Accordingly, the genes ofinterest comprising interaction domains may comprise proteins that aremembers of these classes of protein domains and their relevant bindingpartners. These domains include, among others, SH2 domains (src homologydomain 2), SH3 domain (src homology domain 3), PTB domain(phosphotyrosine binding domain), FHA domain (forkedhead associateddomain), WW domain, 14-3-3 domain, pleckstrin homology domain, C1domain, C2 domain, FYVE domain (Fab-1, YGL023, Vps27, and EEA1), deathdomain, death effector domain, caspase recruitment domain, Bcl-2homology domain, bromo domain, chromatin organization modifier domain, Fbox domain, hect domain, ring domain (Zn⁺² finger binding domain), PDZdomain (PSD-95, discs large, and zona occludens domain), sterile a motifdomain, ankyrin domain, arm domain (armadillo repeat motif), WD domainand EF-hand (calretinin), PUB domain (Suzuki T. et al. (2001) Biochem.Biophys. Res. Commun. 287: 1083–87), nucleotide binding domain, Y Boxbinding domain, H. G. domain, all of which are well known in the art.

Since protein interactions domains are pervasive in cellular signaltransduction cascades and other cellular processes, such as cell cycleregulation and protein degradation, expression of single proteins ormultiple proteins with interaction domains acting in specific signalingor regulatory pathway may provide a basis for inactivating, activating,or modulating such pathways in normal and diseased cells. In anotheraspect, the preferred embodiments comprise binding partners of theseinteractions domains, which are well known to those skilled in the artor are identifiable by well known methods (i.e. yeast two hybridtechnique, co-precipitation of immune complexes, etc.).

Included within the protein-interaction domains are transcriptionalactivation domains capable of activating transcription when fused to anappropriate DNA binding domain. Transcriptional activation domains arewell known in the art. These include activator domains from GAL4 (aminoacids 1–147; Fields, S. et al. (1989) Nature 340: 245–46; Gill, G. etal. (1990) Proc. Natl. Acad. Sci. USA 87: 2127–31), GCN4 (Hope, I. A. etal. (1986) Cell 46: 885–94), ARD1 (Thukral, S. K. et al. (1989) Mol.Cell. Biol. 9: 2360–69), human estrogen receptor (Kumar, V. et al.(1987) Cell 51: 941–51), VP16 (Triezenberg, S. J. et al. (1988) GenesDev. 2: 718–29), Sp1 (Courey, A. J. (1988) Cell 55: 887–98), AP-2(Williams, T. et al. (1991) Genes Dev. 5: 670–82), and NF-kB p65 subunitand related Rel proteins (Moore, P. A. et al. (1993) Mol. Cell. Biol.13: 1666–74). DNA binding domains include, among others, leucine zipperdomain, homeo box domain, Zn⁺² finger domain, paired domain, LIM domain,ETS domain, and T Box domain.

Since the genes of interest may comprise DNA binding domains andtranscriptional activation domains, other genes of interest useful forexpression in the present invention are transcription factors. Preferredtranscription factors are those producing a cellular phenotype whenexpressed within a particular cell type. Transcription factors asdefined herein include both transcriptional activator or inhibitors. Asnot all cells will respond to expression of a particular transcriptionfactor, those skilled in the art can choose appropriate cell strains inwhich expression of a transcription factor results in dominant oraltered phenotypes as described below.

In another aspect, the transcription factor regulates expression of adifferent promoter of interest on an expression vector that does notencode the transcription factor. This arrangement requires introducinginto a single cell a plurality or multiple vectors, as described below,one of which expresses the transcription factor regulating the differentpromoter of interest. Expression of the transcription factor is madeinducible or the transcription factor itself is an inducibletranscription factor, thus allowing further regulation of the differentpromoter of interest.

In an alternative embodiment, the transcription factor encoded by thegene of interest regulates the promoter on the expression vectorencoding the transcription factor. Thus, these constructs areautoregulatory for expression of the fusion nucleic acid (Hofmann, A.(1996) Proc. Natl. Acad. Sci. USA 93: 5185–90). Accordingly, if thetranscription factor inhibits the promoter activity on the expressionvector, continued synthesis of transcription factor restricts expressionof the fusion nucleic acid. On the other hand, if the transcriptionfactor activates transcription, synthesis is elevated because ofcontinued synthesis of the transcriptional activator. Consequently, byuse of separation sequences to express a plurality of genes of interest,one of which encodes the transcription factor, the retroviral vectorautoregulates expression of the genes of interest. To enhanceautoregulation, the transcription factor is an inducible transcriptionfactor, for example a tetracycline or steroid inducible transcriptionfactor (e.g., RU-486 or ecdysone inducible, see White J H (1997) Adv.Pharmacol. 40: 339–67). Incorporation of an inducible transcriptionfactor in a retroviral vector as a single autoregulatory cassetteeliminates the need for additional vectors for regulating the promoteractivity. Moreover, this system results in rapid, uniform expression ofthe gene(s) of interest.

In another preferred embodiment, the gene of interest encodes a proteinwhose expression has a dominant effect on the cell (i.e., produces analtered cellular phenotype). By “dominant effect” herein is meant thatthe protein or peptide produces an effect upon the cell in which it isexpressed, or on another cell not expressing the dominant effectprotein, and is detected by the methods described below. The dominanteffect may act directly on the cell to produce the phenotype or actindirectly on a second molecule, which leads to a specific phenotype.Dominant effect is produced by introducing into cells small moleculeeffectors, expressing a single protein, or by expressing multipleproteins acting in combination (e.g., proteins acting synergistically ona cellular pathway or a multisubunit protein effector). As is well knownin the art, expression of a variety of genes of interest may produce adominant effect. Expressed proteins may be mutant proteins that areconstitutive for a biological activity (Segouffin-Cariou, C. et al.(2000) J. Biol. Chem. 275: 3568–76; Luo et al. (1997) Mol. Cell. Biol.17: 1562–71) or are inactive forms that sequester or inhibit activity ofnormal binding partners (Bossu, P. (2000) Oncogene, 19: 2147–54;Mochizuki, H. (2001) Proc. Natl. Acad. Sci. USA 98: 10918–23). Theinactive forms as defined herein include expression of small modularprotein-interaction regions or other domains that bind to bindingpartners in the cell (see for example, Gilchrist, A. et al. (1999) J.Biol. Chem. 274: 6610–16). Dominant effects are also produced byoverexpression of normal cellular proteins, expression of proteins notnormally expressed in a particular cell type, or expression of normallyfunctioning proteins in cells lacking functional proteins due tomutations or deletions (Takihara, Y. et al. (2000) Carcinogenesis 21:2073–77; Kaplan, J. B. (1994) Oncol. Res. 6: 611–15). Random peptides orbiased random peptides introduced into cells can also produce dominanteffects. An exemplary effect of a dominant effect by a peptide is randompeptides which bind to Src SH3 domain resulting in increased Srcactivity. This activation is due to the peptides' antagonistic effect onnegative regulation of Src (see Sparks, A. B. et al. (1994) J Biol Chem.269: 23853–56).

As defined herein, dominant effect is not restricted to the effect onthe cell expressing the protein. A dominant effect may be on a cellcontacting the expressing cell or by secretion of the protein encoded bythe gene of interest into the cellular medium. Proteins with dominanteffect on other cells are conveniently directed to the plasma membraneor secretion by incorporating appropriate secretion and/or membranelocalization signals. These membrane bound or secreted dominant effectorproteins may comprise cytokines and chemokines, growth factors, toxins(e.g., neurotoxins), extracellular proteases (e.g., metalloproteases),cell surface receptor ligands (e.g., sevenless type receptor ligands),adhesion proteins (e.g., L1, cadherins, integrins, laminin), etc.

In an alternative embodiment, the gene of interest encodes a conditionalgene product. By “conditional gene” product herein is meant a geneproduct whose activity is only apparent under certain conditions, forexample at particular ranges of temperature. Other factors thatconditionally affect activity of a protein include, but are not limitedto, ion concentration, pH, and light (see Hager, A. (1996) Planta 198:294–99; Pavelka J. (2001) Bioelectromagnetics 22: 371–83). A conditionalgene product produces a specific cellular phenotype under a restrictivecondition. In contrast, the conditional gene product does not produce aspecific phenotype under permissive conditions. Methods for making orisolating conditional gene products are well known (see for exampleWhite, D. W. et al. (1993) J. Virol. 67:6876–81; Parini, M. C. (1999)Chem. Biol. 6: 679–87).

As is appreciated by those skilled in the art, conditional gene productsare useful in examining genes that are detrimental to a cell's survivalor in examining cellular biochemical and regulatory pathways in whichthe gene product functions. For those gene products that affect cellsurvival, use of conditional gene products allow survival of the cellsunder permissive conditions, but results in lethality or detriment atthe restrictive condition. This feature allows screens at therestrictive condition for candidate agents, such as proteins and smallmolecules that may directly or indirectly suppress the effect of aconditional gene product but permit maintenance and growth of cellsunder permissive conditions. In addition, conditional gene products arealso useful in screens for regulators of cell physiology when theconditional gene product is a participant in a cellular regulatorypathway. At the restrictive condition, the conditional gene productceases to function or becomes activated, resulting in an altered cellphenotype due to dysregulation of the regulatory pathway. Candidateagents are then screened for their ability to activate or inhibitdownstream pathways to bypass the disrupted regulatory point.Conditional gene products are well known in the art and include, amongothers, proteins such as dynamin involved in endocytic pathway (Damke,H. et al. (1995) Methods Enzymol. 257: 209–20), p53 involved in tumorsuppression (Pochampally, R. et al. (2000) Biochem. Biophys. Res. Comm.279: 1001–10 and Buckbinder, L. et al. (1994) Proc. Natl. Acad. Sci. USA91: 10640–44), Vac1 involved in vesicle sorting, proteins involved inviral pathogenesis (SV40 Large T Antigen; Robinson C. C. (1980). JVirol. 35: 246–48), and gene products involved in regulating the cellcycle, such as ubiquitin conjugating enzyme CDC 34 (Ellison, K. S. etal. (1991) J. Biol. Chem. 266: 24116–20).

In another preferred embodiment, the gene of interest comprises amultiple cloning site (MCS). This allows cassetting in of various genesof interest into the expression vectors. In one preferred embodiment,the MCS lacks nucleotide sequences capable of functioning as atranslation initiation site, which allows cloning a gene of interestcontaining its own translation initiation sequences. Alternatively, theMCS comprises a peptide or protein coding region with its owntranslation initiation sequence for expressing proteins or peptideslacking a translation initiation sequence. In addition, other nucleicacid sequences that increase expression of the first gene of interest(e.g., Gly or GlyGly following the initiating methionine residue) may beincluded in the multiple cloning site. The coding region may alsocomprise an indicator gene, such as lacZ, to permit identification ofinserts by insertional inactivation of lacZ. In these constructs, use ofa promoter controlling element capable of being active in botheukaryotes and prokaryotes will allow detecting lacZ in prokaryotesduring the cloning process (see Wirtz, E. et al. (1995) Science 268:1179–83). In either case, a separation sequence chosen from a proteasebased, IRES based, of Type 2A based sequence, is operably linked to themultiple cloning site. When at least one of the genes of interestcomprises rGFP or pGFP, expression of the fluorescent proteins allowsmonitoring expression of a gene of interest cloned into the MCS.

In yet another preferred embodiment, the gene of interest comprisescandidate bioactive agents comprising candidate nucleic acids, asdescribed below. Thus, a gene of interest may comprise candidatebioactive agents in the form of cDNAs, cDNA fragments, genomic DNAfragments, and nucleic acids encoding random or biased random peptides,as described below. Expression of fusion nucleic acids where the gene ofinterest is a candidate agent allows selection of cells expressing thecandidate agent based on expression of the rGFP or pGFP.

In the present invention, there is no particular order of the first geneof interest and the second gene of interest. When at least one of thegenes of interest is rGFP or pGFP, a preferred embodiment may have agene of interest upstream of the GFP. Another preferred embodiment mayhave the GFP upstream and the gene of interest downstream. By “upstream”and “downstream” herein is meant the proximity to the point oftranscription initiation, which is generally localized 5′ to the codingsequence of the fusion nucleic acid. Thus, in a preferred embodiment,the upstream position is more proximal to the transcription initiationsite than the downstream position.

As will be appreciated by those skilled in the art, the positioning ofthe gene of interest relative to the GFP is determined by the personskilled in the art. Factors to consider include the need for detectingexpression of a gene of interest or optimizing the synthesis of aprotein of interest. In the embodiments described above, the GFP genemay be placed downstream of the gene of interest so that expression ofthe GFP will be a faithful indication of expression of the gene ofinterest. This will depend on the types of separation sites chosen bythe person skilled in the art. When protease cleavage or Type 2Aseparation sequences are incorporated into the fusion nucleic acid, aGFP or other reporter gene situated downstream of the gene of interestwill generally provide direct information on expression of the gene ofinterest. In the case of IRES sequences, however, detecting expressionof the GFP or reporter gene to monitor expression of an upstream gene ofinterest is less direct since separate translation initiations occur forthe first genes of interest and the second gene of interest, generallyresulting in lower amount of the second protein being made. In somecases, the ratio of expression of first and second proteins can be ashigh as 10:1.

The order of the gene of interest on the fusion nucleic acid and thechoice of separation sequence is also important when the relativeamounts gene of interest are at issue. For example, use of IRESsequences may result in lower amounts of downstream gene product ascompared to upstream GFP gene because of differing translationinitiation rates. Relative levels of translation initiation is easilydetermined by comparing expression of upstream gene of interest versusdownstream gene of interest. Where controlling expression levels areimportant, the person skilled in the art will order the gene productneeded at higher levels upstream of the downstream gene product whenIRES separation sequences are used. Alternatively, multiple copies ofIRES sequences are adaptable to increase expression of the downstreamgene. On the other hand, use of protease or Type 2A separation sequenceswill lessen the need for ordering the gene of interest on the fusionnucleic acid since these separation sequences tend to produce equallevels of upstream and downstream gene product.

As will be appreciated by those skilled in the art, various combinationsof genes of interest may be used in the fusion nucleic acids of thepresent invention. In a preferred embodiment, at least one of the genesof interest comprises a rGFP or pGFP gene, or its variants. In oneaspect, the rGFP or pGFP protein functions as a reporter protein formonitoring expression of the gene of interest. For example, if the geneof interest is a nucleic acid encoding a dominant effect protein, acandidate agent comprising cDNA, or a candidate nucleic acid encoding arandom peptide, expression of rGFP or pGFP provides a basis forselecting cells expressing the gene of interest and for monitoring theirexpression levels. In another aspect, expression of the rGFP or pGFPalong with a gene of interest comprising another reporter or selectiongene allows for increased discrimination for selecting cells expressingthe fusion nucleic acid. This increased selectivity is desirable whenmeasuring promoter activity, for example when screening for candidateagents affecting promoter activity.

In another preferred, at least one of the genes of interest comprises afusion nucleic acid encoding a rGFP or pGFP fusion protein. In oneaspect, the rGFP or pGFP is fused to a cDNA, genomic DNA, or nucleicacid encoding a random peptide. That is, the rGFP or pGFP fusion proteincomprises candidate agents, as described below. In these constructs, agene of interest may comprise a distinguishable reporter gene to monitorexpression of the rGFP or pGFP fusion protein. In another aspect, thegene of interest may comprise a dominant effect protein, a cell cyclegene, or a conditional gene product that produces a specific cellularphenotype. This allows identification of candidate agents expressed byat least one of the gene of interest (i.e., the rGFP or pGFP fused tocDNA, genomic DNA or random peptides) that alters the cellular phenotypeproduced by another gene of interest. In another aspect, the gene ofinterest may comprise a cellular biosensor, which allows analysis ofcell physiological events induced by expression of a separate rGFP orpGFP fusion protein.

When the vectors are used to express separate protein products encodedby the genes of interest, the fusion nucleic acids further compriseseparation sequences. By a “separation sequence” or “separation site” orgrammatical equivalents as used herein is meant a sequence that resultsin protein products not linked by a peptide bond. Separation may occurat the RNA or protein level. By being separate does not preclude thepossibility that the protein products of the first gene of interest andthe second gene of interest interact either non-covalently or covalentlyfollowing their synthesis. Thus, the separate protein products mayinteract through hydrophobic domains, protein-interaction domains,common bound ligands, or through formation of disulfide linkages betweenthe proteins.

Various types of separation sequences may be employed. In one preferredembodiment, the separation sequence encodes a recognition site for aprotease. A protease recognizing the site cleaves the translated proteinproduct into two or more proteins. Preferred protease cleavage sites andcognate proteases include, but are not limited to, prosequences ofretroviral proteases including human immunodeficiency virus protease,and sequences recognized and cleaved by trypsin (EP 578472), Takasuga,A. et al. (1992) J. Biochem. 112: 652–57), proteases encoded byPicornaviruses (Ryan, M. D. et al. (1997) J. Gen. Virol. 78: 699–723),factor X_(a) (Gardella, T. J. et al. (1990) J. Biol. Chem. 265:15854–59; WO 9006370), collagenase (J03280893; WO 9006370; Tajima, S. etal. (1991) J. Ferment. Bioeng. 72: 362), clostripain (EP 578472),subtilisin (including mutant H64A subtilisin, Forsberg, G. et al. (1991)J. Protein Chem. 10: 517–26), chymosin, yeast KEX2 protease(Bourbonnais, Y. et al. (1988) J. Bio. Chem. 263: 15342–47), thrombin(Forsberg et al., supra; Abath, F. G. et al. (1991) BioTechniques 10:178), Staphylococcus aureus V8 protease or similar endoproteinase-Glu-Cto cleave after Glu residues (EP 578472; Ishizaki, J. et al. (1992)Appl. Microbiol. Biotechnol. 36: 483–86), cleavage by NIa proteainase oftobacco etch virus (Parks, T. D. et al. (1994) Anal. Biochem. 216:413–17), endoproteinase-Lys-C (U.S. Pat. No. 4,414,332) andendoproteinase-Asp-N, Neisseria type 2 IgA protease (Pohiner, J. et al.(1992) Biotechnology 10: 799–804), soluble yeast endoproteinase yscF (EP467839), chymotrypsin (Altman, J. D. et al. (1991) Protein Eng. 4:593–600), enteropeptidase (WO 9006370), lysostaphin, a polyglycinespecific endoproteinase (EP 316748), the family of caspases (i.e.,caspase 1, caspase 2, capase 3, etc.), and metalloproteases.

The present invention also contemplates protease recognition sitesidentified from a genomic DNA, cDNA, or random nucleic acid libraries(see for example, O'Boyle, D. R. et al. (1997) Virology 236: 338–47).For example, the fusion nucleic acids of the present invention maycomprise a separation site which is a randomizing region for the displayof candidate protease recognition sites. The first and second gene ofinterest encode reporters molecules useful for detecting proteaseactivity, such as rGFP or pGFP capable of undergoing FRET with otherfluorescent proteins via linkage through a candidate recognition site(see Mitra, R. D. et al. (1996) Gene;173: 13–7). Proteases are expressedor introduced into cells expressing these fusion nucleic acids. Randompeptide sequences acting as substrates for the particular proteaseresult in separate GFP proteins when acted on by a protease, thusproducing a loss of FRET signal. By identifying classes of proteaserecognition sites, optimal or novel protease recognition sequences maybe determined.

In addition to their use in producing separate proteins of interest, theprotease cleavage sites and the cognate proteases are also useful inscreening for candidate agents that enhance or inhibit proteaseactivity. Since many proteases are crucial to pathogenesis of organismsor cellular regulation, for example the HIV or caspase proteases, theability to express reporter or selection proteins linked by a proteasecleavage site allows screens for therapeutic agents directed against aparticular protease.

Another preferred embodiment of separation sequences are internalribosome entry sites (IRES). By “internal ribosome entry sites”,“internal ribosome binding sites”, or “IRES elements”, or grammaticalequivalents herein is meant sequences that allow CAP independentinitiation of translation (Kim, D. G. et al. (1992) Mol. Cell. Biol. 12:3636–43; McBratney, S. et al. (1993) Curr. Opin. Cell Biol. 5: 961–65).IRES sequences appear to act by recruiting 40S ribosomal subunit to themRNA in the absence of translation initiation factors required fornormal CAP dependent translation initiation. IRES sequences areheterogenous in nucleotide sequence, RNA structure, and factorrequirements for ribosome binding. They are frequently located on theuntranslated leader regions of RNA viruses, such as the Picornaviruses.The viral sequences range from about 450–500 nucleotides in length,although IRES sequences may also be shorter or longer (Adam, M. A. etal. (1991) J. Virol. 65: 4985–90; Borman, A. M. et al. (1997) NucleicAcids Res. 25: 925–32; Hellen, C. U. et al. (1995) Curr. Top. Microbiol.Immunol. 203: 31–63; Mountford, P. S. et al. (1995) Trends Genet. 11:179–84). Embodiments of viral IRES separation sites are the Type I IRESsequences present in entero- and rhinoviruses and Type II sequences ofcardioviruses and apthoviruses (i.e. encephalomyocarditis virus; seeElroy-Stein, O. et al. (1989) Proc. Natl. Acad. Sci. USA 86: 6126–30;Alexander, L. et al. (1994) Proc. Natl. Acad. Sci. USA 91: 1406–10).Other viral IRES sequences are found in hepatitis A viruses (Brown, E.A. et al. (1994) J. Virol. 68: 1066–74), avian reticuloendotheleliosisvirus (Lopez-Lastra, M. et al. (1997) Hum. Gene Ther. 8: 1855–65),Moloney murine leukemia virus (Vagner, S. et al. (1995) J. Biol. Chem.270: 20376–83), short IRES segments of hepatitis C virus (Urabe, M. etal. (1997) Gene 200: 157–62), and DNA viruses (i.e. Karposi'ssarcoma-associated virus, Bieleski, L. et al. (2001) J. Virol. 75:1864–69).

Additionally, preferred embodiments of IRES sequences are non-viral IRESelements found in a variety of organisms including yeast, insects, birdsand mammals. Like the viral IRES sequences, cellular IRES sequences areheterogeneous in sequence and secondary structure. Cellular IRESsequences, however, may comprise shorter nucleic acid sequences ascompared to viral IRES elements (Oh, S. K. et al. (1992) Genes Dev. 6:1643–53; Chappell, S. A. et al. (2000) 97: 1536–41). Specific IRESsequences include, but are not limited to, those involved in expressionof immunoglobulin heavy chain binding protein, transcription factors,protein kinases, protein phosphatases, eIF4G (see Johannes, G. et al.(1999) Proc. Natl. Acad. Sci. USA 96: 13118–23; Johannes, G. et al.(1998) RNA 4: 1500–13), vascular endothelial growth factor (Huez, I. etal. (1989) Mol. Cell. Biol. 18: 6178–90), c-myc (Stoneley, M. et al.(2000) Nucleic Acids Res. 28: 687–94), apoptotic protein Apaf-1(Coldwell, M. J. et al. (2000) Oncogene 19: 899–905), DAP-5(Henis-Korenblit, S. et al. (2000) Mol. Cell Bio. 20: 496–506), connexin(Werner, R. (2000) IUBMB Life 50: 173–76), Notch-2 (Lauring, S. A. etal. (2000) Mol. Cell. 6: 939–45), and fibroblast growth factor(Creancier, L. et al. (2000) J. Cell. Biol. 150: 275–81). As some IRESsequences act or function efficiently in particular cell types, theperson skilled in the art will choose IRES elements with relevance toparticular cells being used to express the fusion nucleic acid.Moreover, multiple IRES sequences in various combinations, eitherhomomultimeric or heteromultimeric arrangements constructed as tandemrepeats or connected via linkers, are useful for increasing efficiencyof translation initiation of the genes of interest. The combinations ofIRES elements comprise at least 2 to 10 or more copies or combinationsof IRES sequences, depending on the efficiency of initiation desired.

In addition to their use as separation sequences, IRES elements serve astargets for therapeutic agents since IRES sequences mediate expressionof proteins involved in viral pathogenesis or cellular disease states.Thus, the present invention is applicable in screens for candidateagents that inhibit IRES mediated translation initiation events. Inthese constructs, the rGFP or pGFP may serve as a reporter of IRESmediated translation or may comprise the candidate agent being screened(e.g, when expressed as a fusion protein with cDNAs or random peptides).

Another preferred embodiment of IRES elements are sequences in nucleicacid or random nucleic acid libraries that function as IRES elements.Screens for these IRES type sequences can employ fusion nucleic acidscontaining bicistronically arranged genes of interest encoding reportergenes or selection genes, or combinations thereof. Genomic, cDNA, orrandom nucleic acid sequences are inserted between the two reporter orselection genes. After introducing the nucleic acid construct intocells, for example by retroviral delivery, the cells are screened forexpression of the downstream gene mediated by functional IRES sequences.Selection is based on expression of selection gene or reporter gene(e.g., FACS analysis for expression of a downstream rGFP or pGFP gene).The upstream gene of interest serves to permit monitoring expression ofthe fusion nucleic acid. The length of the nucleic acids screened ispreferably 6 to 100 nucleotides, although longer nucleic acids may beused.

The present invention further contemplates use of enhancers of IRESmediated translation initiation. IRES initiated translation may beenhanced by any number of methods. Cellular expression of virallyencoded proteases, which cleaves eIF4F to remove CAP-binding activityfrom the 40S ribosome complexes, may be employed to increase preferencefor IRES translation initiation events. These proteases are found insome Picornaviruses and can be expressed in a cell by introducing theviral protease gene by transfection or retroviral delivery (Roberts, L.O. (1998) RNA 4: 520–29). Other enhancers adaptable for use with IRESelements include cis-acting elements, such as 3′ untranslated region ofhepatitis C virus (Ito, T. et al. (1998) J. Virol. 72: 8789–96) andpolyA segments (Bergamini, G. et al. (2000) RNA 6: 1781–90), which maybe included as part of the fusion nucleic acid of the present invention.In addition, preferential use of cellular IRES sequences may occur whenCAP dependent mechanisms are impaired, for example by dephosphorylationof 4E-BP, proteolytic cleavage of eIF4G, or when cells are placed understress by g-irradiation, amino acid starvation, or hypoxia. Thus, inaddition to the methods described above, IRES enhancing proceduresinclude activation or introduction of 4E-BP targeted phosphatases orproteases of eIF4G. Alternatively, the cells are subjected to stressconditions described above. Other trans-acting IRES enhancers includeheterogeneous nuclear ribonucleoprotein (hnRNP) (Kaminski, A. et al.(1998) RNA 4: 626–38), PTB hnRNP E2/PCBP2 (Walter, B. L. et al. (1999)RNA 5: 1570–85), La autoantigen (Meerovitch, K. et al. (1993) J. Virol.67: 3798–07), unr (Hunt, S. L. et al. (1999) Genes Dev. 13: 437–48),ITAF45/Mpp1 (Pilipenko, E. V. et al. (2000) Genes Dev. 14: 2028–45),DAP5/NAT1/p97 (Henis-Korenblit, S. et al. (2000) Mol. Cell. Biol. 20:496–506), and nucleolin (Izumi, R. E. et al. (2001) Virus Res. 76:17–29). These factors may be introduced into a cell either alone or incombination. Accordingly, various combinations of IRES elements andenhancing factors are used to effect a separation reaction.

In another preferred embodiment, the separation sites are Type 2Aseparation sequences. By “Type 2A” sequences herein is meant nucleicacid sequences that when translated inhibit formation of peptidelinkages. Type 2A sequences are distinguished from IRES sequences inthat 2A sequences do not involve CAP independent translation initiation.Without being bound by theory, Type 2A sequences appear to act bydisrupting peptide bond formation between the nascent polypeptide chainand the incoming activated tRNA^(PRO) (Donnelly, M. L. et al. (2001) J.Gen. Virol 82: 1013–25). Although the peptide bond fails to form, theribosome continues to translate the remainder of the RNA to produceseparate peptides unlinked at the carboxy terminus of the 2A peptideregion. An advantage of Type 2A separation sequences is that nearstoichiometric amounts of first protein of interest and second proteinof interest are made as compared to IRES elements. Moreover, Type 2Asequences do not appear to require additional factors, such as proteasesthat are required to effect separation when using protease recognitionsites.

Preferred Type 2A separation sequences are those found in cardioviraland apthoviral genomes. These sequences are approximately 21 amino acidslong and have the general sequence XXXXXXXXXXLXXDXEXNPGP (SEQ ID NO:23), where X is any amino acid. Disruption of peptide bond formationoccurs between the underlined carboxy terminal glycine (G) and proline(P). These 2A sequences are found in the apthovirus Foot and MouthDisease Virus (FMDV), cardiovirus Theiler's murine encephalomyelitisvirus (TME), and encephalomyocarditis virus (EMC). Various viral Type 2Asequences are shown in FIG. 9 (SEQ ID NOS: 12–23). The 2A sequencesfunction in a wide range of eukaryotic expression systems, thus allowingtheir use in a variety of cells and organisms. Accordingly, insertingthese 2A separation sequences in between the nucleic acids encoding thefirst gene of interest and second gene of interest, as more fullyexplained below, will lead to expression of separate protein products ofthe first gene of interest and the second gene of interest.

In another embodiment, the present invention contemplates mutatedversions or variants of Type 2A sequences. By “mutated” or “variant” orgrammatical equivalents herein is meant deletions, insertions,transitions, transversions of nucleic acid sequences that exhibit thesame qualitative separating activity as displayed by the naturallyoccurring analogue, although preferred mutants or variants have higherefficient separating activity and efficient translation of thedownstream gene of interest. Mutant variants include changes in nucleicacid sequence that do not change the corresponding 2A amino acidsequence, but incorporate frequently used codons (i.e., codon optimized)to allow efficient translation of the 2A region (see Zolotukin, S. etal. (1996) J. Virol. 70: 4646–54). In another aspect, the mutantvariants are changes in nucleic acid sequence that change thecorresponding 2A amino acid sequence. Thus, one embodiment of a variant2A sequences are short deletions of the 21 amino acid 2A sequence thatretains separating activity. The deletion may comprise removal of about3 to 6 amino acids at the amino terminus of the 2A region. In anotherembodiment, Type 2A sequences are mutated by methods well known in theart, such as chemical mutagenensis, oligonucleotide directedmutagenesis, and error prone replication. Mutants with alteredseparating activity are readily identified by examining expression ofthe fusion nucleic acids of the present invention. Assaying forproduction of a separate downstream gene product, such as a reporterprotein or a selection protein, allows for identifying sequences havingseparating activity. Another method for identifying variants may use aFRET based assay using linked GFP molecules, as described above.Insertion of variant 2A sequences in place of or adjacent to the gly-serlinker region, or other suitable regions linking the GFPs, will allowdetection of functional 2A separation sequences by identifyingconstructs that produce separated GFP molecules, as measured by loss ofFRET signal. Sequences having no or reduced separating activity willretain higher levels of FRET signal due to physical linkage of the GFPmolecules. This strategy will permit high throughput analysis ofvariants and allows selecting of sequences having high efficiency Type2A separating activity.

In yet another embodiment, Type 2A separation sequences include homologspresent in other nucleic acids, including nucleic acids of otherviruses, bacteria, yeast, and multicellular organisms such as worms,insects, birds, and mammals. Homology in this context means sequencesimilarity or identity. A variety of sequence based alignmentmethodologies, which are well known to those skilled in the art, areuseful in identifying homologous sequences. These include, but notlimited to, the local homology algorithm of Smith, F. and Waterman, M.S. (1981) Adv. Appl. Math. 2: 482–89, homology alignment algorithm ofPeason, W. R. and Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85:2444–48, Basic Local Alignment Search Tool (BLAST) described byAltschul, S. F. et al. (1990) J. Mol. Biol. 215: 403–10, or the Best Fitprogram described by Devereau, J. et al. (1984) Nucleic Acids. Res. 12:387–95, and the FastA and TFASTA alignment programs, preferably usingdefault settings or by inspection.

In one preferred embodiment, similarity or identity for any nucleic acidor protein outlined herein is calculated by Fast alignment algorithmsbased upon the following parameters: mismatch penalty of 1.0; gap sizepenalty of 0.33, joining penalty of 30 (see “Current Methods inComparison and Analysis” in Macromolecule Sequencing and Synthesis:Seleted Methods and Applications, p. 127–149, Alan R. Liss, Inc., 1998).Another example of a useful algorithm is PILEUP. PILEUP creates multiplesequence alignment from a group of related sequences using progressive,pairwise alignments. It can also plot a tree showing the clusteringrelationships used to create the alignment. PILEUP uses a simplificationof the progressive alignment method of Feng, D. F. and Doolittle, R. F.(1987) J. Mol. Evol. 25, 351–60, which is similar to the methoddescribed by Higgins, D. G. and Sharp, P. M. (1989) CABIOS 5: 151–3.Useful parameters include a default gap weight of 3.00, a default gaplength weight of 0.10, and weighted end gaps.

Another example of a useful algorithm is the family of BLAST alignmenttools initial described by Altschul et al. (see also Karlin, S. et al.(1993) Proc. Natl. Acad. Sci. USA 90: 5873–87). A particularly usefulBLAST program is WU-BLAST-2 program described in Altschul, S. F. et al.(1996) Methods Enzymol. 266: 460–80. WU-BLAST uses several searchparameters, most of which are set to default values. The adjustableparameters are set with the following values: overlap span=1, overlapfraction=0.125, word threshold (T)=11. The HSP S and HSP S2 parametersare dynamic values and are established by the program itself dependingupon the composition of the particular sequence and composition of theparticular database against which the sequence of interest is beingsearched; however, the values may be adjusted to increase sensitivity. A% amino acid sequence identity value is determined by the number ofmatching identical residues divided by the total number of residues ofthe longer sequence in the aligned region. The “longer” sequence is onehaving the most actual residues in the aligned region (gaps introducedby WU-BLAST-2 to maximize the alignment score are ignored).

In a similar manner, “percent (%) nucleic acid sequence identity” withrespect to the coding sequence of the polypeptide described herein isdefined as the percentage of the nucleotide residues in a candidatesequence that are identical with the nucleotide residues in the codingsequence of the Type 2A regions. A preferred method utilizes the BLASTNmodule of WU-BLAST-2 set to the default parameters, with overlap spanand overlap fraction set to 1 and 0.125, respectively.

An additional useful algorithm is gapped BLAST as reported by Altschul,S. F. et al. (1997) Nucleic Acids Res. 25: 3389–402. Gapped BLAST usesBLOSSOM-62 substitution scores; threshold parameter set to 9; thetwo-hit method to trigger ungapped extensions; charges gap lengths of kat cost of 10+k; Xu set to 16, and Xg set to 40 for database searchstage and to 67 for the output stage of the algorithms. Gappedalignments are triggered by a score corresponding to −22 bits.

The alignment may include the introduction of gaps in the sequence to bealigned. In addition, for sequence which contain either more or feweramino acids that the Type 2A sequences in FIG. 3, it is understood thatthe percentage of the homology will be determined based on the number ofhomologous amino acids in relation to the total number of amino acids.Thus, Type 2A sequences may be shorter or longer than the amino acidsequence shown in FIG. 3.

Another embodiment of Type 2A separating sequences are those sequencespresent in libraries of nucleic acids, including genomic DNA or cDNAthat have Type 2A separating activity. By Type 2A separating activityherein is meant a nucleic acid which encodes a amino acid sequence thatexhibits similar separating activity as the naturally occurring Type 2Asequences. Segments of nucleic acids are inserted between the first geneof interest and second gene of interest in the fusion nucleic acids ofthe present invention and examined for separating activity as describedabove. The preferred lengths to be tested are nucleic acids encodingpeptides 5 to 50 amino acids or larger, with a more preferred range ofpeptides 10–30 amino acids long.

Embodiments of Type 2A sequence also encompass random nucleic acidlibraries encoding peptides that have Type 2A separating activity. Inthese embodiments, the separation site represents a randomizing regionwhere random or biased random nucleic acids encoding random or biasedrandom peptides are inserted between the first gene of interest andsecond gene of interest. The preferred lengths of the random nucleicacids are nucleic acids encoding peptides 5 to 50 amino acids, with amore preferred range of peptides 10–30 amino acids. Random peptideshaving separating activity are identified using the above describedassays. Identification of functional separation sequences will permitadditional searches for related sequences having Type 2A like separatingactivity, either through homology searches, mutagenesis screens, or byuse of biased random peptide sequences. Sequences with separatingactivity can then be used to express separate proteins of interestaccording to the present invention.

In a preferred embodiment, the genes of interest are linked to a fusionpartner to form a fusion polypeptide as described above. In a preferredembodiment, combinations of fusion partners are used, with or withoutlinkers.

As will be appreciated by those skilled in the art, the fusion nucleicacids of the present invention are not limited to a fusion nucleic acidcomprising only a promoter, first gene of interest, separation sequence,and a second gene of interest. Any number of separation sequences andgenes of interest may be used in the fusion nucleic acid. Additionalseparation sequences may be chosen from protease based, IRES based, orType2A based separating sequences and added to the fusion nucleic acidsalong with additional genes of interest. Consequently, a preferredembodiment further comprises a plurality of separating sequences andgenes of interest. Thus, in one aspect, the fusion nucleic acidscomprises a second separating sequence and a third gene of interest, andmay further comprise a third separating sequence and a fourth gene ofinterest. As will be appreciated by those skilled in the art, byinserting additional separating sequences and additional genes ofinterest, any number of proteins encoded by the genes of interest may beseparately expressed. Additional separating sequences and genes ofinterest may be desired in screening methods where the first and secondgene of interest encode reporter proteins whose activities are affectedby a third gene of interest or where expression of more than two genesof interest is necessary to produce a cellular phenotype.

The nucleic acids and the fusion nucleic acids described herein can beprepared using standard recombinant DNA techniques described in, forexample, Sambrook, J. et al., Molecular Cloning: A Laboratory Manual,2nd edition, Cold Spring Harbor Press, Cold Spring Harbor, N.Y., 1989,and Ausubel, F. et al., Current Protocols in Molecular Biology, GreenePublishing Associates and John Wiley & Sons, New York, N.Y., 1994.Generally, the expression vectors also contain the required regulatoryor control sequences (e.g., promoters and promoter controlling elements,translation initiation and termination sequences, polyadenylationsequences, splicing signals, etc.), cloning and subcloning sites,reporter/selection or marker genes for identifying cells containing thefusion nucleic acid, and priming regions for sequencing, polymerasechain reaction, or library synthesis, and the like. As described above,these nucleic acid sequences are operably linked such that the resultingfusion nucleic acids are placed in a functional relationship with eachother. That is, the components described are placed in a relationshippermitting them to function is their intended manner.

When the fusion nucleic acids contain separation sequences, constructingthe fusion nucleic acid will depend in part on the separation sequenceemployed. The separation sequence is operably linked to the first geneof interest and second gene of interest such that the fusion nucleicacid is capable of producing separate protein products of interest.Thus, in a preferred embodiment, the separation sequence is placed inbetween the first gene of interest and the second gene of interest. Aswill be appreciated by those skilled in the art, use of separationsequences based on protease recognition or Type 2A sequences requiresthat the fusion nucleic acid comprising the first gene of interest,separation sequence, and second gene of interest be in frame. By “inframe” herein is meant that the fusion nucleic acid encodes a continuoussingle polypeptide comprising the protein encoded by the first gene ofinterest, protein encoded by the separation sequence, and proteinencoded by the second gene of interest. Standard recombinant DNAtechniques may be used for placing the components of the fusion nucleicacid to encode a contiguous single polypeptide. Linkers may be added tothe separation sequence to facilitate the separation reactions or limitstructural interference of the separation sequence on the genes ofinterest. Preferred linkers are (Gly)_(n) linkers, where n is 1 or more,with n being two, three, four, five or six, although linkers to 7–10 ormore amino acids are possible.

As is appreciated by those skilled in the art, use of IRES sequencesdoes not require the first gene of interest, separation sequence, andsecond gene of interest to be in frame since IRES sequences function asinternal translation initiation sites. Accordingly, fusion nucleic acidsusing IRES elements have the genes of interest arranged in a cistronicstructure. That is, transcription of the fusion nucleic acid produces acistronic mRNA that encodes both first gene of interest and second geneof interest with the IRES element controlling translation initiation ofthe downstream gene of interest. Alternatively, separate IRES sequencesmay control the upstream and downstream gene of interest.

Nucleic acids for making libraries of the fusion nucleic acidscomprising genomic DNA or cDNA as described herein are made by methodswell known in the art. The libraries may also be directed to specificset of encoded protein sequences, such as protein interaction domains.These may be synthesized using standard oligonucleotide synthesismethods, by using libraries of cloned nucleic acids, or use of multiplexPCR of nucleic acids encoding the desired polypeptide domains.

When the nucleic acids comprise libraries of random nucleic acidssequences or random encoded peptides, these nucleic acids are preferablysynthesized using known oligonucleotide synthesis techniques. Thesetechniques include synthetic methods well known in the art and include,among others, phosphoramidite, phosphoramidate, and phosphonatechemistries (see Eckstein, Oligonucleotide and Analogues: A PracticalApproach, IRL Press, Oxford University Press, 1991). Synthesis iscontrolled such that nucleic acids are totally random or biased random,as more fully described below.

Cells and cellular libraries comprising the fusion nucleic acids of thepresent invention are generated by introducing the fusion nucleic acidsinto a plurality of cells. By a “plurality of cells” herein is meant atleast two cells, with at least 10³ being preferred, at least about 10⁶being particularly preferred, and at least about 10⁸ and 10⁹ beingespecially preferred. This plurality of cells may comprise a cellularlibrary, wherein generally each cell within the library contains amember of the library, for example different random nucleic acids, cDNAsor cDNA fragments, genomic DNA, and combinations thereof. As will beappreciated by those skilled in the art, some cells within the librarymay not contain a member of the library, and some may contain more thanone. When methods other than retroviral infection are used to introducethe fusion nucleic acids into a plurality of cells, the distribution ofcandidate nucleic acids within the individual members of the cellularlibrary may vary widely, as it is generally difficult to control thenumber of nucleic acids which are introduced into a cells, such aselectroporation or transfection.

The fusion nucleic acids are introduced into cells for expressing thefusion polypeptides and for screening, as is more fully described below.By “introduced into” or grammatical equivalents herein is meant that thenucleic acids enter the cells in a manner suitable for subsequentexpression of the nucleic acid. The method of introduction is largelydictated by the targeted cell type. Exemplary methods include CaPO₄precipitation, dextran sulfate transfection, liposome fusion,Lipofectin®, electroporation, biolistic particle bombardment,microinjection, viral infection, etc. The person skilled in the art canchoose the appropriate method of introduction based on the cells and theform of the nucleic acid being introduced. As many pharmaceuticallyimportant screens require human or model mammalian cell targets,retroviral vectors capable of transfecting such targets are preferred.

In a preferred embodiment, the preferred vectors are retroviral vectors.Preferred retroviral vectors include a vector based on the murine stemcell virus (MSCV) (see Hawley, R. G. et al. (1994) Gene Ther. 1: 136–38)and a modified MFG virus (Riviere, I. et al. (1995) Genetics 92:6733–37), and pBABE. Other suitable vector include, among others, LRCXretroviral vector set; pSIR retroviral vector; pLEGFP-NI retroviralvector, pLAPSN retroviral vector; pLXIN retroviral vector; pLXSNretroviral vector; all of which are commercially available (e.g.,Clontech, Palo Alto, Calif.). When target cells are non-proliferating(e.g., brain cells), useful viral vectors are derived from lentiviruses(Miyoshi, H. et al. (1998) J. Virol. 72: 8150–57), adenoviruses (Zheng,C. et al. (2000) Nat. Biotechnol. 18: 176–80) or alphaviruses(Ehrengruber, M. U. (1999) Proc. Natl. Acad. Sci. USA 96: 7041–46).

Preferably, the fusion nucleic acids and the library of fusion nucleicacids or candidate agents are first cloned into a viral shuttle vectorto produce a library of plasmids. A typical shuttle vector is pLNCX(Clontech, Palo Alto, Calif.). The resulting plasmid library can beamplified in E. coli., purified, and introduced into retroviralpackaging cell lines. Suitable retroviral packaging cell lines include,but are not limited to the Bing and BOSC23 cells lines (WO 94/19478;Soneoka, Y. et al. (1985) Nucleic Acids Res. 23: 628–33; Finer, M. H. etal. (1994) Blood 83: 43–50); Phoenix packaging lines such asPhiNX-ampho; 292T+gag pol and retrovirus envelope; PA 317; and othercell lines outlined in Markowitz, D. et al. (1998) Virology 167: 400–06(see also Markowitz, D. et al. (1998) J. Virol. 63: 1120–24; Li, K. J.et al. (1996) Proc. Natl. Acad. Sci. USA 93: 11658–63; and Kinsella, T.M. et al. (1996) Hum. Gene Ther. 7:1405–13).

In a preferred embodiment, viruses are made by transient transfection ofthe cell lines referenced above. The resulting viruses can either beused directly or be used to infect another retroviral cell line forexpansion of the library.

In a preferred embodiment, the library of virus particles is used totransfect packaging cell lines disclosed herein to produce a primaryviral library. By “primary viral library” herein is meant a library ofvirus particles comprising the fusion nucleic acids of the presentinvention. The production of the primary library is preferably doneunder conditions known in the art to reduce clone bias. The resultingprimary viral library can be titred and stored, used directly to infecta target host cell line, or be used to infect another retroviralproducer cell for “expansion” of the library.

Concentration of virus may be done as follows. Generally, retrovirusesare titred by applying retrovirus containing supernatant onto indicatorcells, such as NIH3T3 cells, and then measuring the percentage of cellsexpressing phenotypic consequences of infection. The concentration ofvirus is determined by multiplying the percentage of cells infected bythe dilution factor involved, and taking into account the number oftarget cells available to obtain relative titre. If the retroviruscontains a reporter gene, such as lacZ, then infection, integration, andexpression of the recombinant virus is measured by histological stainingfor lacZ expression or by flow cytometry (i.e., FACS analysis). Ingeneral, retroviral titres generated from even the best of the producercells do not exceed 10⁷ per ml unless concentrated, for example bycentrifugation and ultrafiltration. However, flow-through transductionmethods can provide up to a ten-fold higher infectivity by infectingcells on a porous membrane and allowing retrovirus supernatant to flowpast the cells. This provides the capability of generating retroviraltitres higher than those achieved by concentration (see Chuck, A. S.(1996) Hum. Gene Ther. 7: 743–50).

To obtain the secondary viral library, host cells are preferablyinfected with a multiplicity of infection (MOI) of 10. By “secondaryviral library” herein is meant a library of retroviral particlesexpressing the claimed fusion nucleic acids and candidate agentsdescribed herein.

As will be appreciated by those in the art, the viral librariesdescribed above are used to produce the cellular libraries of thepresent invention. As will be appreciated by those in the art, the typesof cells used in the present invention can vary widely. Basically anymammalian cells may be used, including preferred cell types from mouse,rat, primate, and human cells. As is more fully described below, celltypes implicated in a wide variety of disease conditions areparticularly useful, so long as a suitable screen may be designed toallow the selection of cells that exhibit an altered phenotype as aconsequence of treating the cells with candidate agents. As will beappreciated by those in the art, modifications of the system bypseudotyping allows all eukaryotic cells to be used, preferably inhigher eukaryotes (Morgan, R. A. et al. (1993) J. Virol. 67: 4712–21;Yang, Y. et al. (1995) Hum. Gene Ther. 6:1203–13).

The fusion nucleic acids are introduced into a host cell and treatedunder the appropriate conditions to induce or cause expression of thefusion protein. As described above, various expression vectors may bemade for introducing the fusion nucleic acids into a variety oforganisms, including prokaryotic and eukaryotic. Appropriate host cellsinclude bacteria, archebacteria, yeast, fungi, worms, plants, insectcells, and animal cells, including fish and mammalian cells. Forexample, bacterial host cells include Bacillus subtilis, Escherichiacoli., Streptococcus cremoris, Streptococcus lividans, Haemophilusinfluenza etc. Yeast cells include Saccharomyces cerevisiae, Candidaalbicans, Candida maltosa, Hansenula polymorpha, Kluyveromyces fragilis,Kluyveromyces lactis, Pichia guillerimondi, Schizosaccharomyces pombe,and Yarrowia lipolytica. Appropriate insect cells include Lepidoteracell lines, such as Spodoptera frugiperda (e.g. Sf9) or Trichoplusia ni.However, those skilled in the art will recognize the applicability ofother insect cell system, such as the silkworm Bombyx mori, Drosophilacells (Schneider 2, KC, BG2-C6, and Shi), A. albopictus, A. aegypti,Choristoneura fumiferana, Heliothis virescens; Heliothis zea, Orgyia,pseudotsugata, Lymantria dispar, Plutella xylostella, Malacostomadisstria, Pieris rapae, Mamestra configurata, and Hyladphora cecropia.In another preferred embodiment, live insects are used to express theproteins of the present invention. Larvae are the preferred form forexpressing the desired product, including the larvae of Manduca sexta,Bombyx mori, Drosophila, and the like, which are susceptible toinfection by recombinant insect viruses.

In a preferred embodiment, the fusion nucleic acids are expressed inmammalian cells. Basically, any mammalian cells may be used, with mouse,rat, primate and human cells being particularly preferred, as will beappreciated by those in the art. When retroviral vectors are used,preferred are mammalian cells in which the library of retroviral vectorsare made.

In a preferred embodiment, cell types implicated in a wide variety ofdisease conditions are particularly useful when screens, as describedbelow, are designed for selecting cells that exhibit an alteredphenotype as a consequence of expression of the gene of interest, forexample a random peptide, within the cell. Accordingly, suitable celltypes include, but are not limited to, tumor cells of all types(particularly melanoma, myeloid leukemia, carcinomas of the lung,breast, ovaries, colon, kidney, prostate, pancreas and testes),cardiomyocytes, endothelial cells, epithelial cells, lymphocytes (T-celland B cell), mast cells, eosinophils, vascular intimal cells,hepatocytes, leukocytes including mononuclear leukocytes, stem cellssuch as haemopoetic, neural, skin, lung, kidney, liver and myocyte stemcells (for use in screening for differentiation and de-differentiationfactors), osteoclasts, chondrocytes and other connective tissue cells,keratinocytes, melanocytes, liver cells, kidney cells, and adipocytes.Suitable cells also include known research cells, including, but notlimited to, Jurkat T cells, NIH3T3 cells, CHO, Cos, etc. (see the ATCCcell line catalog, hereby expressly incorporated by reference).

To provide those skilled in the art the tools to use the presentinvention, the nucleic acids and cells of the present invention areassembled into kits. The components included in the kits may comprisethe fusion nucleic acids (e.g., expression vectors or libraries),enzymatic reagents for making the fusion nucleic acid constructs, cellsfor packaging and amplification of viruses, and reagents fortransfection and transduction into target cells. Alternatively, the kitscontain libraries of fusion nucleic acids capable of being introducedinto cells and/or contain cells already stably expressing the fusionnucleic acids (e.g., via integration of the retroviruses into thecellular chromosome).

In the present invention, the fusion nucleic acids and cells comprisingthe fusion nucleic acids of the present invention find use in screensfor candidate agents producing an altered cellular phenotype. By“candidate agent” or “candidate small molecules” or “candidateexpression products” herein is meant an agent or expression productwhich may be tested for the ability to alter the phenotype of a cell.

Candidate bioactive agents encompass numerous chemical classes, thoughtypically they are organic molecules, preferably small organic compoundshaving a molecular weight of more than 100 and less than about 2,500daltons. Candidate agents comprise functional groups necessary forstructural interaction with proteins, particularly hydrogen bonding, andtypically include at least an amine, carbonly, hydroxyl, or carboxylgroup, preferably at least two of them functional chemical groups. Thecandidate agents often comprise cyclical carbon or heterocyclicstructures, and/or aromatic or polyaromatic structures substituted withone or more of the above functional groups. Candidate agents are alsofound among biomolecules including peptides, saccharides, fatty acids,steroids, purines, pyrimidines, and their derivatives, structuralanalogs or combinations thereof. Particularly preferred are proteins,candidate drugs, and other small molecules.

Candidate agents are obtained from a wide variety of sources, includinglibraries of synthetic or natural compounds. For example, numerous meansare available for random and directed synthesis of a wide variety oforganic compounds and biomolecules, including expression of randomizedoligonucleotides (see for example, Gallop, M. A. et al. (1994) J. Med.Chem. 37: 1233–51; Gordon, E. M. et al. (1994) J. Med. Chem. 37:1385–401; Thompson, L. A. et al. (1996) Chem. Rev. 96: 555–600;Balkenhol, F. et al. (1996) Angew. Chem. Int. Ed. 35: 2288–337; andGordon, E. M. et al. (1996) Acc. Chem. Res. 29: 444–54). Alternatively,libraries of natural compounds in the form of bacterial, fungal, plant,and animal extracts are available or readily produced. Additionally,natural or synthetically produced libraries and compounds are readilymodified through conventional chemical, physical, and biochemical means.Known pharmacological agents may be subjected to directed or randomchemical modifications such as acylation, alkylation, esterification,and amidification to produce structural analogs.

The candidate agent can be pesticides, insecticides or environmentaltoxins; a chemical (including solvents, polymers, organic molecules,etc); therapeutic molecules (including therapeutic and abused drugs,antibiotics, etc.); biomolecules (including hormones, cytokines,proteins, lipids, carbohydrates, cellular membrane antigens andreceptors (neural, hormonal, nutrient, and cell surface receptors) ortheir ligands, etc); whole cells (including prokaryotic and eukaryotic(including pathogenic cells), including mammalian tumor cells); viruses(including retroviruses, herpes viruses, adenoviruses, lentiviruses,etc.); and spores (e.g., fungal, bacterial etc.).

In a preferred embodiment of candidate agents are proteins. By “protein”herein is meant at least two covalently attached amino acids, whichincludes proteins, polypeptides, oligopeptides and peptides. The proteinmay be made up of naturally occurring amino acids and peptide bonds, orsynthetic peptidomimetic structures. Thus, “amino acid” or “peptideresidue” as used herein means both naturally occurring and syntheticamino acids. For example, homo-phenylalanine, citrulline, and norleucineare considered amino acids for the purposes of the invention. “Aminoacids” also includes imino residues such as proline and hydroxyproline.The side chains may be either the (R) or (S) configuration. In thepreferred embodiment, the amino acids are in the (S) or L configuration.If non-naturally occurring side chains are used, non-amino acidsubstituents may be used, for example to prevent or retard in-vivodegradations. Proteins including non-naturally occurring amino acids maybe synthesized or in some cases, made by recombinant techniques (see vanHest, J. C. et al. (1998) FEBS Lett. 428: 68–70 and Tang et al. (1999)Abstr. Pap. Am. Chem. S218: U138—U138 Part 2, both of which areexpressly incorporated by reference herein).

In a preferred embodiment, the candidate bioactive agents are naturallyoccurring proteins or fragments of naturally occurring proteins. Thus,for example, cellular extracts containing proteins, or random ordirected digests of proteinaceous cellular extracts, may be used. Inthis way, libraries of procaryotic and eukaryotic proteins may be madefor screening in the systems described herein. Particularly preferred inthis embodiment are libraries of bacterial, fungal, viral, and mammalianproteins, with the latter being preferred, and human proteins beingespecially preferred.

Candidate agents may encompass a variety of peptidic agents. Theseinclude, but are not limited to, (1) immunoglobulins, particularly IgEs,IgGs, and IgMs, and particularly therapeutically or diagnosticallyrelevant antibodies, including but not limited to, for example,antibodies to human albumin, apolipoproteins (including apolipoproteinE), human chorionic gonadotropin, cortisol, α-fetoprotein, thyroxin,thyroid stimulating hormone (TSH), antithrombin, antibodies topharmaceuticals (including antieptileptic drugs (phenyloin, primidone,carbariezepin, ethosuximide, valproic acid, and phenobarbitol),cardioactive drugs (digoxin, lidocaine, procainamide, and disopyramide),bronchodilators (theophylline), antibiotics (chloramphenicol,sulfonamides), antidepressants, immunosuppresants, abused drugs(amphetamine, methamphetamine, cannabinoids, cocaine and opiates) andantibodies to any number of viruses (including orthomyxoviruses, (e.g.influenza virus), paramyxoviruses (e.g. respiratory syncytial virus,mumps virus, measles virus), adenoviruses, rhinoviruses, coronaviruses,reoviruses, togaviruses (e.g. rubella virus), parvoviruses, poxviruses(e.g. variola virus, vaccinia virus), enteroviruses (e.g. poliovirus,coxsackievirus), hepatitis viruses (including A, B and C), herpesviruses(e.g. Herpes simplex virus, varicella-zoster virus, cytomegalovirus,Epstein-Barr virus), rotaviruses, Norwalk viruses, hantavirus,arenavirus, rhabdovirus (e.g. rabies virus), retroviruses (includingHIV, HTLV-I and -II), papovaviruses (e.g. papillomavirus),polyomaviruses, and picornaviruses, and the like), and bacteria(including a wide variety of pathogenic and non-pathogenic prokaryotesof interest including Bacillus; Vibrio, e.g. V. cholerae; Escherichia,e.g. Enterotoxigenic E. coli, Shigella, e.g. S. dysenteriae; Salmonella,e.g. S. typhi; Mycobacterium e.g. M. tuberculosis, M. leprae;Clostridium, e.g. C. botulinum, C. tetani, C. difficile, C. perfringens;Cornyebacterium, e.g. C. diphtheriae; Streptococcus, S. pyogenes, S.pneumoniae; Staphylococcus, e.g. S. aureus; Haemophilus, e.g. H.influenzae; Neisseria, e.g. N. meningitidis, N. gonorrhoeae; Yersinia,e.g. G. lamblia Y. pestis, Pseudomonas, e.g. P. aeruginosa, P. putida;Chlamydia, e.g. C. trachomatis; Bordetella, e.g. B. pertussis;Treponema, e.g. T. palladium; and the like); (2) enzymes (and otherproteins), including but not limited to, enzymes used as indicators ofor treatment for heart disease, including creatine kinase, lactatedehydrogenase, aspartate amino transferase, troponin T, myoglobin,fibrinogen, cholesterol, triglycerides, thrombin, tissue plasminogenactivator (tPA); pancreatic disease indicators including amylase,lipase, chymotrypsin and trypsin; liver function enzymes and proteinsincluding cholinesterase, bilirubin, and alkaline phosphatase; aldolase,prostatic acid phosphatase, terminal deoxynucleotidyl transferase, andbacterial and viral enzymes such as HIV protease; (3) hormones andcytokines (many of which serve as ligands for cellular receptors) suchas erythropoietin (EPO), thrombopoietin (TPO), the interleukins(including IL-1 through IL-17), insulin, insulin-like growth factors(including IGF-1 and -2), epidermal growth factor (EGF), transforminggrowth factors (including TGF-α and TGF-β), human growth hormone,transferrin, epidermal growth factor (EGF), low density lipoprotein,high density lipoprotein, leptin, VEG F, PDG F, ciliary neurotrophicfactor, prolactin, adrenocorticotropic hormone (ACTH), calcitonin, humanchorionic gonadotropin, cortisol, estradiol, follicle stimulatinghormone (FSH), thyroid-stimulating hormone (TSH), luteinizing hormone(LH), progesterone, testosterone; and (4) other proteins (includingα-fetoprotein, carcinoembryonic antigen CEA.

In a preferred embodiment, the candidate bioactive agents are peptidesof from about 5 to about 30 amino acids, with from about 5 to about 20amino acids being preferred, and from about 7 to about 15 beingparticularly preferred. These peptides may be digests of naturallyoccurring proteins, as described above, or random peptides or “biased”random peptides, and peptide analogs either chemically synthesized orencoded by candidate nucleic acids. By “randomized” or grammaticalequivalents herein is meant that each nucleic acid and peptide consistsof essentially random nucleotides and amino acids, respectively.Generally, since these random peptides (or nucleic acids, discussedbelow) are chemically synthesized, they may incorporate any amino acidor nucleotide at any position. The synthetic process can be designed togenerate randomized proteins or nucleic acids to allow the formation ofall or most of the possible combinations over the length of thesequence, thus forming a library of randomized candidate bioactiveproteinaceous agents.

In one preferred embodiment, the library is fully randomized, with nosequence preference or constants at any position. In another preferredembodiment, the library is biased. That is, some positions within thesequence are either held constant or are selected from a limited numberof possibilities. For example, in a preferred embodiment, thenucleotides or amino acid residues are randomized within a definedclass, for example hydrophobic amino acids, hydrophilic residues,sterically biased (either small or large) residues, or are amino acidresidues for crosslinking (i.e. cysteines) or phosphorylation sites(i.e. serines, threonines, tyrosines, or histidines).

In a preferred embodiment, the bias is toward peptides or nucleic acidsthat interact with known classes of molecules. For example, it is knownthat much of intracellular signaling is carried out by short regions ofpolypeptide interacting with other polypeptide regions of otherproteins, such as the interaction domains described above. Anotherexample of interaction domain is a short region from the HIV-1 envelopecytoplasmic domain that has been previously shown to block the action ofcellular calmodulin. Regions of the Fas cytoplasmic domain, which showshomology to the mastoparn toxin from Wasps, can be limited to a shortpeptide region with death inducing apoptotic or G protein inducingfunctions. Magainin, a natural peptide derived from Xenopus, can havepotent anti-tumor and anti-microbial activity. Short peptide fragmentsof a protein kinase C isozyme (β-PKC) have been shown to block nucleartranslocation of PKC in Xenopus oocytes following stimulation. Inaddition, short SH-3 target proteins have been used as pseudosubstratesfor specific binding to SH-3 proteins. This is of course a short list ofavailable peptides with biological activity, as the literature is densein this area. Thus, there is much precedent for the potential of smallpeptides to have activity on intracellular signaling cascades. Inaddition, agonists and antagonists of any number of molecules may beused as the basis of biased randomization of candidate bioactive agentsas well.

Thus, a number of molecules or protein domains are suitable as startingpoints for generating biased candidate agents. A large number of smallmolecule domains are known that confer common function, structure oraffinity. These include protein—protein interaction domains and nucleicacid interaction domains described above. As is appreciated by those inthe art, while variations of these protein—protein or protein-nucleicacid domains may have weak amino acid homology, the variants may havestrong structural homology.

In another preferred embodiment, the candidate agents are nucleic acids.By “nucleic acid” or “oligonucleotide” or grammatical equivalents hereinis meant at least two nucleotides covalently linked together. A nucleicacid of the present invention will generally contain phosphodiesterbonds, although in some cases, as outlined below, nucleic acid analogsare included that may have alternate backbones, comprising, for example,phosphoramide (Beaucage, S. L. et al. (1993) Tetrahedron 49: 1925–63 andreferences therein; Letsinger, R. L. et al. (1970) J. Org. Chem. 35:3800–03; Sprinzl, M. et al. (1977) Eur. J. Biochem. 81: 579–89;Letsinger, R. L. et al. (1986) Nucleic Acids Res. 14: 3487–99; Sawai etal (1984) Chem. Lett. 805; Letsinger, R. L. et al. (1988) J. Am. Chem.Soc. 110: 4470; and Pauwels et al. (1986) Chemica Scripta 26:141–49),phosphorothioate (Mag, M. et al. (1991) Nucleic Acids Res. 19: 1437–41;and U.S. Pat. No. 5,644,048), phosphorodithioate (Briu et al. (1989) J.Am. Chem. Soc. 111: 2321), O-methylphophoroamidite linkages (seeEckstein, Oligonucleotides and Analogues: A Practical Approach, OxfordUniversity Press, 1991), and peptide nucleic acid backbones and linkages(Egholm, M. (1992) Am. Chem. Soc. 114: 1895–97; Meier et al. (1992)Chem. Int. Ed. Engl. 31:1008; Egholm, M (1993) Nature 365: 566–68;Carlsson, C. et al. (1996) Nature 380: 207, all of which areincorporated by reference). Other analog nucleic acids include thosewith positive backbones (Dempcy, R. O. et al. (1995) Proc. Natl. Acad.Sci. USA 92: 6097–101); non-ionic backbones (U.S. Pat. Nos. 5,386,023,5,637,684, 5,602,240, 5,216,141 and 4,469,863; Kiedrowshi et al. (1991)Angew. Chem. Intl. Ed. English 30: 423; Letsinger, R. L. et al. (1988)J. Am. Chem. Soc. 110: 4470; Letsinger, R. L. et al. (1994) Nucleoside &Nucleotide 13: 1597; Chapters 2 and 3, ASC Symposium Series 580,“Carbohydrate Modifications in Antisense Research”, Ed. Y. S. Sanghuiand P. Dan Cook; Mesmaeker et al. (1994) Bioorganic & Medicinal Chem.Lett. 4: 395; Jeffs et al. (1994) J. Biomolecular NMR 34: 17; (1996)Tetrahedron Lett. 37: 743) and non-ribose backbones, including thosedescribed in U.S. Pat. Nos. 5,235,033 and 5,034,506, and Chapters 6 and7, ASC Symposium Series 580, “Carbohydrate Modifications in AntisenseResearch”, Ed. Y. S. Sanghui and P. Dan Cook. Nucleic acids containingone or more carbocyclic sugars are also included within the definitionof nucleic acids (see Jenkins et al. (1995) Chem. Soc. Rev. 169–76).Several nucleic acid analogs are described in Rawls, C & E News Jun. 2,1997 page 35. All of these references are hereby expressly incorporatedby reference. These modifications of the ribose-phosphate backbone maybe done to facilitate the addition of additional moieties, such aslabels, or to increase the stability and half-life of such molecules inphysiological environments.

In addition, mixtures of different nucleic acid analogs, and mixtures ofnaturally occurring nucleic acids and analogs may be made. The nucleicacids may be single stranded or double stranded, as specified, orcontain portions of both double stranded or single stranded sequence.The nucleic acid may be DNA, both genomic and cDNA, RNA or hybrid, wherethe nucleic acid contains any combination of deoxyribo- andribonucleotides, and any combination of bases, including uracil,adenine, thymine, cytosine, guanine, xanthine hypoxanthine, isocytosine,isoguanine, etc., although generally occurring bases are preferred.

In a preferred embodiment, the candidate nucleic acids comprise cDNAs,including cDNA libraries, or fragments of cDNAs. The cDNAs can bederived from any number of different cells and include cDNAs generatedfrom eucaryotic and procaryotic cells, viruses, cells infected withviruses or other pathogens, genetically altered cells, cells withdefective cellular processes, etc. Preferred embodiments include cDNAsmade from different individuals, such as different patients,particularly human patients. The cDNAs may be complete libraries orpartial libraries. Furthermore, the candidate nucleic acids can bederived from a single cDNA source or multiple sources; that is, cDNAfrom multiple cell types or multiple individuals or multiple pathogenscan be combined in a screen. In other aspects, the cDNA may encodespecific domains, such as signaling domains, protein interactiondomains, membrane binding domains, targeting domains, etc. The cDNAs mayutilize entire cDNA constructs or fractionated constructs, includingrandom or targeted fractionation. Suitable fractionation techniquesinclude enzymatic (i.e., DNase I, restriction nucleases, etc.),chemical, or mechanical fractionation (i.e., sonicated or sheared). Alsouseful for the present invention are cDNA libraries enriched for aspecific class of proteins, such as type I membrane proteins (Tashiro,K. et al. (1993) Science 261: 600–03) and membrane proteins (Kopczynski,C. C. (1998) Proc. Natl. Acad. Sci. USA 95: 9973–78). Additionally,subtracted cDNA libraries in which genes preferentially or exclusivelyexpressed in particular cells, tissues, or developmental phases areenriched. Methods for making subtracted cDNA libraries are well known inthe art (see Diatchenko, L. et al. (1999) Methods Enzymol. 303: 349–80;von Stein, O. D. et al. (1997) Nucleic Acids Res. 13: 2598–602:Carcinci, P. (2000) Genome Res. 10: 1431–32). Accordingly, a cDNAlibrary may be a complete cDNA library from a cell, a partial library,an enriched library from one or more cell types, or a constructedlibrary with certain cDNAs being removed to from a library.

In another preferred embodiment, the candidate nucleic acids comprisegenomic nucleic acids, including organellar nucleic acids. As elaboratedabove for cDNAs, the genomic nucleic acids may be derived from anynumber of different cells, including genomic nucleic acids ofeukaryotes, prokaryotes, or viruses. They may be from normal cells orcells defective in cellular processes, such as tumor suppression, cellcycle control, or cell surface adhesion. Moreover, the genomic nucleicacids may be obtained from cells infected with pathogenic organisms, forexample cells infected with viruses or bacteria. The genomic nucleicacids comprise entire genomic nucleic acid constructs or fractionatedconstructs, including random or targeted fractionation as describedabove. Generally, for genomic nucleic acids and cDNAs, the candidatenucleic acids may range from nucleic acids lengths capable of encodingproteins of twenty to thousands of amino acid residues, with from about50–1000 being preferred and from about 100–500 being especiallypreferred. In addition, candidate agents comprising cDNA or genomicnucleic acids may also be subsequently mutated using known techniques(e.g., exposure to mutagens, error prone PCR, error prone transcription,combinatorial splicing (e.g., cre-lox recombination) to generate novelnucleic acid sequences (or protein sequences). In this way libraries ofprocaryotic and eukaryotic nucleic acids may be made for screening inthe systems described herein. Particularly preferred in the embodimentsare libraries of bacterial, fungal, viral and mammalian nucleic acids,with the latter being preferred, and human nucleic acids beingespecially preferred.

In another preferred embodiment, the candidate nucleic acids compriselibraries of random nucleic acids. Generally, the random nucleic acidsare fully randomized or they are biased in their randomization, e.g., innucleotide/residue frequency generally or per position. As definedabove, by “randomized” or grammatical equivalents herein is meant thateach nucleic acid consists essentially of random nucleotides. Since thecandidate nucleic acids are chemically synthesized, they may incorporateany nucleotide at any position. In the expressed random nucleic acid, atleast 10, preferably at least 12, more preferably at least 15, mostpreferably at least 21 nucleotide positions need to be randomized. Thecandidate nucleic acids may also comprise nucleic acid analogs asdescribed above.

For candidate nucleic acids encoding peptides, the candidate nucleicacids generally contain cloning sites which are placed to allow in-frameexpression of the randomized peptides, and any fusion partners, ifpresent, such as presentation structures and the GFPs of the presentinvention. For example, when presentation structures are used, thepresentation structure will generally contain the initiating ATG as partof the parent vector. For candidate agents comprising RNAs, in additionto chemically synthesized RNA nucleic acids, the candidate nucleic acidsmay be expressed from vectors, including retroviral vectors. Thus, whenthe RNAs are expressed, vectors expressing the candidate nucleic acidsare generally constructed with an internal promoter (i.e., CMVpromoter), tRNA promoter, cell specific promoter, or hybrid promotersdesigned for immediate and appropriate expression of the RNA structureat the initiation site of RNA synthesis. For retroviral vectors, the RNAmay be expressed anti-sense to the direction of retroviral synthesis andis terminated as known, for example with an orientation specificterminator sequences. Interference from native viral promoter initiatedtranscription may be minimized in the target cell by using the SINvectors described herein.

When the nucleic acids are expressed in the cells, they may or may notencode a protein as described herein. Thus, included within candidatenucleic acids of the present invention are RNAs capable of producing analtered phenotype. Thus, in one aspect, the nucleic acid may be anantisense nucleic acid directed towards a complementary target nucleicacid. As is well known in the art, antisense nucleic acids find use insuppressing or affecting expression of various genes of pathogenicorganisms or expression of cellular genes. These include suppression ofoncogenes to affect the proliferative properties of transformed cells(Martiat, P. et al. (1993) Blood 81: 502–09; Daniel, R. (1995) Oncogene10: 1607–14; and Niemeyer, C. C. (1998) Cell Death Differ. 5: 440–49),modulate cell cycle (Skotz, M. et al. (1995) Cancer Res. 55: 5493–98),inhibit proteins involved in cardiovascular disease states (Wang, H.(1999) Circ. Res. 85: 614–22) and inhibit viral pathogenesis (Lo, K. M.et al. (1992) Virology 190: 176–83; and Chatterjee S. et al (1992)Science 258: 1485–88).

In another preferred embodiment, the candidate nucleic acids are nucleicacids capable of catalyzing cleavage of target nucleic acids in asequence specific manner, preferably in the form of ribozymes. Ribozymesinclude among others hammerhead ribozymes, hairpin ribozymes, andhepatitis delta virus ribozymes (Tuschl, T. (1995) Curr. Opin. Struct.Biol. 5: 296–302; Usman N. (1996) Curr Opin Struct Biol 6: 527–33;Chowrira B. M. et al. (1991) Biochemistry 30: 8518–22; and Perrotta A.T. et al. (1992) Biochemistry 3: 16–21). As with antisense nucleicacids, nucleic acids catalyzing cleavage of target nucleic acids may bedirected to a variety of expressed nucleic acids, including those frompathogenic organisms or cellular genes (see for example, Jackson, W. H.et al. (1998) Biochem. Biophys. Res. Commun. 245: 81–84).

Another preferred embodiment of candidate nucleic acids are doublestranded RNA capable of inducing RNA interference or RNAi (Bosher, J. M.et al. (2000) Nat. Cell Biol. 2: E31–36). Introducing double strandedRNA can trigger specific degradation of homologous RNA sequences,generally within the region of identity of the dsRNA (Zamore, P. D. et.al. (1997) Cell 101: 25–33). This provides a basis for silencingexpression of genes, thus permitting a method for altering the phenotypeof cells. The dsRNA may comprise synthetic RNA made either by knownchemical synthetic methods or by in vitro transcription of nucleic acidtemplates carrying promoters (e.g., T7 or SP6 promoters). Alternatively,the dsRNAs are expressed in vivo, preferably by use of palindromicfusion nucleic acids, that allow facile formation of dsRNA (e.g., in theform of a hairpin) when expressed in the cell.

In a preferred embodiment, a library of candidate bioactive agents areused. These include libraries of small molecules, nucleic acids,peptides, cDNAs, genomic nucleic acids, etc. In a preferred embodiment,for candidate agents comprising random nucleic acids and peptides, thelibrary should provide a sufficiently structurally diverse population ofrandomized expression products to effect a probabilistically sufficientrange to provide one or more peptide products which has the desiredproperties, such as binding to protein interaction domains or producinga desired cellular response. Accordingly, a library must be large enoughso that at least one of its members will have a structure that gives itaffinity for some molecule, protein or other factor whose activity isinvolved in some cellular response, such as signal transduction.Although it is difficult to gauge the required absolute size of aninteraction library, nature provides a hint with the immune response: adiversity of 10⁷–10⁸ different antibodies provides at least onecombination with sufficient affinity to interact with most potentialantigens faced by an organism. Published in vitro selection techniqueshave also shown that a library size of about 10⁷ to 10⁸ is sufficient tofind structures with affinity for the target. A library of allcombinations of a peptide 7–20 amino acids in length, such as proposedhere for expression in retroviruses, has the potential to code for 20⁷(10⁹) to 20²⁰. Thus with libraries of 10⁷ to 10⁸ per ml of retroviralparticles, the present methods allow a “working” subset of atheoretically complete interaction library for 7 amino acids, a subsetof shapes for the 20²⁰ library. Thus, in a preferred embodiment, atleast 10⁶, preferably at least 10⁷, more preferably at least 10⁸ andmost preferably at least 10⁸ different expression products aresimultaneously analyzed in the subject methods. Preferred methodsmaximize library size and diversity.

The candidate bioactive agents are combined or added to a cell orpopulation of cells or plurality of cells. By “population of cells” or“plurality of cells” herein is meant at least two cells, with at leastabout 10⁵ being preferred, at least about 10⁶ being particularlypreferred, and at least about 10⁷, 10⁸, and 10⁹ being especiallypreferred.

The candidate agents and the cells are combined. As will be appreciatedby those in the art, this may be accomplished in any number of ways,including adding the candidate agents to the surface of the cells, tothe media containing the cells, or to a surface on which the cells growor contact; or adding the agents into the cells, for example by usingvector that will introduce agents into the cells, especially when theagents are nucleic acids or proteins.

In a preferred embodiment, the candidate agents are either nucleic acidsor proteins that are introduced into the cells to screen for candidateagents capable of altering the phenotype of a cell. By “introduced into”or grammatical equivalents herein is meant that the nucleic acids enterthe cells in a manner suitable for subsequent expression of the nucleicacid or protein. The method of introduction is largely dictated by thetargeted cell type. Known methods include CaPO₄ transfection, DEAEdextran transfection, liposome fusion, Lipofectin®, electroporation,viral infection, biolistic particle bombardment, etc. The candidatenucleic acids may exist either transiently or stably in the cytoplasm orstably integrate into the genome of the host cell (i.e., by retroviralintegration, homologous recombination). When mammalian cells are used,retroviral vectors capable of transfecting such targets are preferred.

In a preferred embodiment, the candidate bioactive agents are eithernucleic acids or proteins (proteins in this context includes proteins,oligopeptides, and peptides) that are expressed in the host cells usingvectors, including viral vectors. The choice of the vector will dependon the cell type. For example, when cells are replicating mammaliancells, retroviral vectors are used. When the cells are non-replicatingmammalian cells, for example when arrested in one of the growth phases,viral vectors capable of infecting non-dividing cells, includinglentiviral and adenoviral vectors, are used to express the nucleic acidsand proteins.

In a preferred embodiment, the candidate bioactive agents are eithernucleic acids or proteins that are introduced into the host cells usingretroviral vectors, as is generally outlined in PCT US 97/01019 and PCTUS97/01048, both of which are expressly incorporated by reference.Generally, a library is generated using a retroviral vector backbone;standard oligonucleotide synthesis is done to generate either thecandidate agent or nucleic acid encoding a protein, for example a randompeptide, using techniques well known in the art. After generating thenucleic acid library, the library is cloned into a first primer, whichserves as a cassette for insertion into the retroviral construct. Thefirst primer generally contains additional elements, including forexample, the required regulatory sequences (e.g. translation,transcription, promoters, etc.) fusion partners, restrictionendonuclease sites, stop codons, regions of complementarity for secondstrand priming.

A second primer is then added, which generally consists of some or allof the complementarity region to prime the first primer and optionalsequences necessary to a second unique restriction site for purposes ofsubcloning. Extension with DNA polymerase results in double strandedoligonucleotides, which are then cleaved with appropriate restrictionendonucleases and subcloned into the target retroviral vectors.

Any number of suitable retroviral vectors may be used. In one aspect,preferred vectors include those based on murine stem cell virus (MSCV)(Hawley, et al. (1994) Gene Therapy 1: 136), a modified MFG virus(Reivere et al. (1995) Genetics 92: 6733), pBABE, and others describedabove. Well suited retroviral transfection systems are described in Mannet al., supra; Pear et al. (1993) Proc. Natl. Acad. Sci. USA 90:8392–96; Kitamura, et al. Human Gene Ther. 7: 1405–1413; Hofmann, et alProc. Natl. Acad. Sci. USA 93: 5185–90; Choate et (1996) Human Gene Ther7: 2247; WO 94/19478; PCT US97/01019, and references cited therein, allof which are incorporated by reference.

The vectors used to introduce candidate agents may include inducible andconstitutive promoters for the expression of the candidate agents, asdescribed above. For example, there are situations wherein it isnecessary to induce peptide expression only during certain phases of theselection process, such as during particular periods of the cell cycle.As described above, a large number of constitutive and induciblepromoters are well known, and may be used to regulate expression of thecandidate agents.

In a preferred embodiment, the bioactive candidate agents comprisingnucleic acids and proteins are linked to a fusion partner, as describedabove. In one aspect, combinations of fusion partners are used. Anynumber of combinations of presentation structures, targeting sequences,rescue sequences, and stability sequences may be used with or withoutlinker sequences. Thus, candidate agents, which include thesecomponents, may be used to generate a library of fragments, eachcontaining a different candidate nucleotide sequence (e.g., randomnucleic acid, cDNA, genomic DNA etc.) that may encode a differentpeptide sequence.

In a preferred embodiment, when the candidate agent is introduced to thecells using expression vectors, the candidate peptide agent is linked toa detectable molecule, and the methods of the invention include at leastone expression assay. Thus, the detectable molecule may comprisereporter and selection genes as described herein. In one preferredembodiment, the detectable molecule is distinguishable from thatexpressed by the fusion nucleic acid expressing a gene of interest. Anexpression assay is an assay that allows the determination of whether acandidate bioactive agent has been expressed, i.e., whether a candidatepeptide agent is present in the cell. Thus, by linking the expression ofa candidate agent to the expression of a detectable molecule such as alabel, the presence or absence of the candidate peptide agent may bedetermined. Accordingly, in this embodiment, the candidate agent isoperably linked to a detectable molecule. Generally, this is done bycreating a fusion nucleic acid. The fusion nucleic acid comprises afirst nucleic acid expressing the candidate bioactive agent (which caninclude fusion partners, as outlined above), and a second nucleic acidexpressing a detectable molecule. In a preferred embodiment, the fusionnucleic acid encodes a fusion polypeptide comprising the candidate agentand the detectable molecule. In another preferred embodiment, the fusionnucleic acid may use one promoter for the first nucleic and a secondpromoter for the second nucleic acid to produce separate nucleic acidscomprising a candidate nucleic acid, which may or may not encode aprotein, and the detectable molecule. In yet another preferredembodiment, the fusion nucleic acid may use separation sequencesdescribed herein to express separate candidate bioactive agent anddetectable molecule. The terms “first” and “second” are not meant toconfer an orientation of the sequences with respect to 5′–3′ orientationof the fusion nucleic acid. For example, assuming a 5′–3′ orientation ofthe fusion sequence, the first nucleic acid may be located either 5′ tothe second nucleic acid, or 3′ to the second nucleic acid. Preferreddetectable molecules in this embodiment include, but are not limited to,various fluorescent proteins and their variants, including A. victoriaGFP, Renilla muelleri GFP, Renilla reniformis GFP, Ptilosarcus gurneyiGFP, YFP, BFP, RFP, Anemonia majano fluorescent proteins, Zoanthusfluorescent proteins, Discosoma striata fluorescent proteins, andClavularia fluorescent proteins.

Thus, in one preferred embodiment, the vectors used to introducecandidate agents comprises a promoter operably linked to fusion nucleicacids encoding fusion polypeptides comprising rGFP or pGFP, includingfusions with random nucleic acids (i.e., for expressing randompeptides), cDNAs, and genomic DNA fragments. Fusions to rGFP or pGFPprovide a way of monitoring expression of the candidate agent, trackingand localization of the candidate agent, and sorting cells expressingthe candidate agents. In another aspect, a preferred embodimentcomprises a vector comprising a promoter, a first gene of interest, aseparation sequence, and second gene of interest comprising rGFP orpGFP. The gene of interest expresses the candidate agent while the GFPreporter allows monitoring its expression. Expressing separate candidateagent and reporter reduces any interference with activity of thecandidate agent by fusing to a reporter protein. If the candidate agentcomprises a rGFP or pGFP fusion protein, the second gene of interest maycomprise a reporter distinguishable from rGPF or pGPF fusion protein.

In general, the candidate agents are added to the cells, eitherextracellularly or intracellularly, as outlined above, under reactionconditions that favor agent-target interactions. Generally, this will bephysiological conditions. Incubations may be performed at anytemperature which facilitates optimal activity, typically between 4 and40° C. Incubation periods are selected for optimum activity, but mayalso be optimized to facilitate rapid high throughput screening.Typically between 0.1 and 24 hour will be sufficient. Excess reagent isgenerally removed or washed away.

A variety of other reagents may be included in the assays. These includereagents like salts, neutral proteins, e.g., albumin, detergents,synthetic polymers (polyethylene glycol, dextran sulfate), ionic agents,etc., which may be used to facilitate optimal protein—protein bindingand/or reduce non-specific or background interactions. Also reagentsthat otherwise improve the efficiency of the assay, such as proteaseinhibitors, nuclease inhibitors, anti-microbial agents, etc., may beused. The mixture of components may be added in any order that providesfor detection. Washing or rinsing the cells will be done as will beappreciated by those in the art at different times, and may include theuse of filtration and centrifugation. When second labeling moieties(also referred to herein as “secondary labels”) are used, they arepreferably added after excess non-bound target molecules are removed, inorder to reduce non-specific binding. However, under some circumstances,all the components may be added simultaneously.

As will be appreciated by those in the art, the type of cells used inthe present invention can vary widely. Basically, the screen may use anycell in which the fusion nucleic acids of the present invention can beintroduced and expressed. These include bacterial, fungal, plant,insect, and mammalian cells. In a preferred embodiment, when the cellsare mammalian cells, particularly preferred cells are mouse, rat,primate and human cells. When the candidate agents are in the form ofretroviral vectors, the screen may use any mammalian cells in which alibrary of retroviral vectors comprising the fusion nucleic acids of thepresent invention are made. In addition, modifications of retroviralsystem by pseudotyping allows nearly all mammalian cell types to be used(see Morgan, R. A. et al. (1993) J. Virol. 67: 4712–21; Yang, Y. et al.(1995) Hum. Gene Ther. 6: 1203–13).

As is more fully described below, a screen is set up such that the cellsexhibit a selectable phenotype in the presence of a candidate agent. Formammalian cells, cell types implicated in a wide variety of diseaseconditions are particularly useful, so long as a suitable screen may bedesigned to allow the selection of cells that exhibit an alteredphenotype as a consequence of the presence of a candidate bioactiveagent within the cell. Accordingly, suitable cell types include, but arenot limited to, tumor cells of all types (particularly melanoma, myeloidleukemia, carcinomas of the lung, breast, ovaries, colon, kidney,prostate, pancreas, and testes), cardiomyocytes, endothelial cells,epithelial cells, lymphocytes (T-cells and B-cells), mast cells,eosinophils, vascular intimal cells, hepatocytes, leukocytes includingmononuclear leukocytes, stem cells such as hemopoietic, neural, skin,lung, kidney, liver and myocyte stem cells (for use in screening fordifferentiation and de-differentiation factors), osteoclasts,chondrocytes and other connective tissue cells, keratinocytes,melanocytes, liver cells, kidney cells, and adipocytes. Suitable cellsalso include known research cells, including, but not limited to, JurkatT cells, NIH3T3 cells, CHO, Cos, etc. See the ATCC cell line catalog,hereby expressly incorporated by reference.

In one embodiment, the cells may be genetically engineered, that is,contain exogenous nucleic acids, for example to contain targetmolecules.

In a preferred embodiment, a first plurality of cells is screened. Thatis, the cells into which the candidate nucleic acids are introduced arescreened for an altered phenotype. Thus, in this embodiment, the effectof the bioactive candidate agent is seen in the same cells in which itis made; i.e., an autocrine effect.

By a “plurality of cells” herein is meant roughly from about 10³ cellsto 10⁸ or 10⁹, with from 10⁶ to 10⁸ being preferred. This plurality ofcells comprises a cellular library, wherein generally each cell withinthe library contains a member of the retroviral molecular library, e.g.,a different candidate nucleic acid, although as will be appreciated bythose in the art, some cells within the library may not contain aretrovirus, and some may contain more than one. When methods other thanretroviral infection are used to introduce the candidate nucleic acidsinto a plurality of cells, the distribution of candidate nucleic acidswithin the individual cell members of the cellular library may varywidely, as it is generally difficult to control the number of nucleicacids which enter a cell during electroporation, transfection, etc.

In a preferred embodiment, the candidate nucleic acids are introducedinto a first plurality of cells, and the effect of the candidatebioactive agents is screened in a second or third plurality of cells,different from the first plurality of cells, i.e., generally a differentcell type. That is, the effect of the bioactive agents is due to anextracellular effect on a second cell, i.e., an endocrine or paracrineeffect. This is done using standard techniques. The first plurality ofcells may be grown in or on one media, and the media is allowed to toucha second plurality of cells, and the effect measured. Alternatively,there may be direct contact between the cells. Thus, “contacting” asused herein is a functional contact, and includes both direct andindirect. In this embodiment, the first plurality of cells may or maynot be screened.

If necessary, the cells are treated to conditions suitable for theexpression of the candidate nucleic acids, for example when induciblepromoters are used, to produce the candidate expression products, eithertranslation or transcription. Expression of the candidate agents resultsin functional contact of the candidate agent and the cell. Thus, in apreferred embodiment, the methods of the present invention compriseintroducing candidate nucleic acids into a plurality of cells, acellular library. The plurality of cells is then screened, as is morefully outlined below, for a cell exhibiting an altered phenotype. Thealtered phenotype is due to the presence of a candidate bioactive agent.

By “altered phenotype” or “changed physiology” or other grammaticalequivalents herein is meant that the phenotype of the cell is altered insome way, preferably in some detectable and/or measurable way. As willbe appreciated in the art, a strength of the present invention is thewide variety of cell types and potential phenotypic changes which may betested using the present methods. Accordingly, any phenotypic changewhich may be observed, detected, or measured may be the basis of thescreening methods herein. Suitable phenotypic changes include, but arenot limited to, gross physical changes such as changes in cellmorphology, cell growth, cell viability, adhesion to substrates or othercells, and cellular density; changes in the expression of one or moreRNAs, proteins, lipids, hormones, cytokines, or other molecules; changesin the equilibrium state (i.e., half-life) or one or more RNAs,proteins, lipids, hormones, cytokines, or other molecules; changes inthe localization of one or more RNAs, proteins, lipids, hormones,cytokines, or other molecules; changes in the bioactivity or specificactivity of one or more RNAs, proteins, lipids, hormones, cytokines,receptors, or other molecules; changes in the secretion of ions,cytokines, hormones, growth factors, or other molecules; alterations incellular membrane potentials, polarization, integrity or transport;changes in infectivity, susceptibility, latency, adhesion, and uptake ofviruses and bacterial pathogens; etc. By “capable of altering thephenotype” herein is meant that the candidate agent can change thephenotype of the cell in some detectable and/or measurable way.

The altered phenotype may be detected in a wide variety of ways, as isdescribed more fully below, and will generally depend and correspond tothe phenotype that is being changed. Generally, the changed phenotype isdetected using, for example, microscopic analysis of cell morphology;standard cell viability assays, including both increased cell death andincreased cell viability, for example, cells that are now resistant tocell death via virus, bacteria, or bacterial or synthetic toxins;standard labeling assays such as fluorometric indicator assays for thepresence or level of a particular cell or molecule, including FACS orother dye staining techniques; biochemical detection of the expressionof target compounds after killing the cells; etc. In some cases, as ismore fully described herein, the altered phenotype is detected in thecell in which the randomized nucleic acid was introduced; in otherembodiments, the altered phenotype is detected in a second cell which isresponding to some molecular signal from the first cell.

In a preferred embodiment, once a cell with an altered phenotype isdetected, the cell is isolated from the plurality which do not havealtered phenotypes. Isolation of the altered cell may be done in anynumber of ways, as is known in the art, and will in some instancesdepend on the assay or screen. Suitable isolation techniques include,but are not limited to, FACS; lysis selection using complement; cellcloning; scanning by Fluorimager, expression of a “survival” protein;induced expression of a cell surface protein or other molecule that canbe rendered fluorescent or taggable for physical isolation; expressionof an enzyme that changes a non-fluorescent molecule to a fluorescentone; overgrowth against a background of no or slow growth; death ofcells and isolation of DNA or other cell vitality indicator dyes; etc.

In a preferred embodiment, the candidate nucleic acid and/or bioactiveagent is isolated from the positive cell. In one aspect, primerscomplementary to DNA regions common to the expression constructs, or tospecific components of the library such as a rescue sequence, definedabove, are used to “rescue” the unique random sequence. Alternatively,the bioactive candidate agent is isolated using a rescue sequence. Forexample, rescue sequences comprising epitope tags or purificationsequences may be used to pull out the bioactive candidate agent usingimmunoprecipitation or affinity columns. In some instances, as isoutlined below, this may also pull out the primary target molecule ifthere is a sufficiently strong binding interaction between the bioactiveagent and the target molecule. Alternatively, the peptide may bedetected using mass spectroscopy.

Once rescued, the sequence of the candidate agent and/or bioactivenucleic acid is determined. This information can then be used in anumber of ways.

In a preferred embodiment, the candidate agent is resynthesized andreintroduced into the target cells to verify the effect. For mammaliancells, this may be done using retroviruses, or alternatively usingfusions to the HIV-1 Tat protein, and analogs and related proteins,which allows very high uptake into target cells (see for example,Fawell, S. et al. (1994) Proc. Natl. Acad. Sci. USA 91: 664–68; Frankel,A. D. et al. (1988) Cell 55: 1189–93; Savion, N. et al. (1981) J. Biol.Chem. 256: 1149–54; Derossi, D. et al. (1994) J. Biol. Chem. 269:10444–50; and Baldin, V. et al. (1990) EMBO J. 9: 1511–17, all of whichare incorporated by reference).

In a preferred embodiment, the sequence of a candidate agent is used togenerate more candidate bioactive agents. For example, the sequence ofthe candidate agent may be the basis of a second round of (e.g., biased)randomization, to develop other candidate agents with increased oraltered activities. Alternatively, the second round of randomization maychange the affinity of the candidate agent. Furthermore, it may bedesirable to put the identified random region of the candidate agentinto other presentation structures, or to alter the sequence of theconstant region of the presentation structure, to alter theconformation/shape of the candidate agent. It may also be desirable to“walk” around a potential binding site, in a manner similar to themutagenesis of a binding pocket, by keeping one end of the ligand regionconstant and randomizing the other end to shift the binding of thepeptide around.

In a preferred embodiment, either the candidate agent or the candidatenucleic acid encoding it is used to identify target molecules. As willbe appreciated by those in the art, there may be primary targetmolecules, to which the candidate agent binds or acts upon directly, andthere may be secondary target molecules, which are part of the signalingpathway affected by the bioactive agent; these might be termed“validated targets”.

In a preferred embodiment, the bioactive agent is used to pull outtarget molecules. For example, as outlined herein, if the targetmolecules are proteins, the use of epitope tags or purificationsequences can allow the purification of primary target molecules viabiochemical means (e.g., co-immunoprecipitation, affinity columns,etc.). Alternatively, the peptide, when expressed in bacteria andpurified, can be used as a probe against a bacterial cDNA expressionlibrary made from mRNA of the target cell type. Alternatively, peptidescan be used as “bait” in either yeast or mammalian two or three hybridsystems. Such interaction cloning approaches have been very useful inisolating DNA-binding proteins and other interacting protein components.The peptide(s) can be combined with other pharmacologic activators tostudy the epistatic relationships of signal transduction pathways inquestion. It is also possible to synthetically prepare labeled peptidecandidate agent and use it to screen a cDNA library expressed inbacteriophage for those expressed cDNAs which bind the peptide.Furthermore, it is also possible that one could use cDNA cloning viaretroviral libraries to “complement” the effect induced by the peptide.In such a strategy, the peptide would be required to bestochiometrically titrating away some important factor for a specificsignaling pathway. If this molecule or activity is replenished byover-expression of a cDNA from within a cDNA library, then one can clonethe target. Similarly, cDNAs cloned by any of the above yeast orbacteriophage systems can be reintroduced to mammalian cells in thismanner to confirm that they act to complement function in the system thepeptide acts upon.

Once primary target molecules have been identified, secondary targetmolecules may be identified in the same manner, using the primary targetas the “bait”. In this manner, signaling pathways may be elucidated.Similarly, bioactive agents specific for secondary target molecules mayalso be discovered to identify a number of bioactive agents acting on asingle pathway, for example for when developing combination therapies.

The methods of the present invention may be useful for screening a largenumber of cell types under a wide variety of conditions. Generally, thehost cells are cells that are involved in disease states, and they aretested or screened under conditions that normally result in undesirableconsequences on the cells. When a suitable bioactive candidate agent isfound, the undesirable effect may be reduced or eliminated.Alternatively, normally desirable consequences may be reduced oreliminated, with an eye towards elucidating the cellular mechanismsassociated with the disease state or signaling pathway.

In view of all the foregoing, the compositions and methods describedherein are useful in a variety of applications. In one preferredembodiment, the compositions of the present invention are useful asreporters for gene expression. In these applications, the compositionsmay be operably linked to the promoter elements to provide a measure ofgene expression. When used with separation sequences as a downstreamgene of interest, the rGFP or pGFP provides a basis for monitoringlevels of expression of the upstream gene of interest.

In another preferred embodiment, the compositions of the presentinvention are useful for tracking and localizing proteins. In theseembodiments, proteins or peptides are fused to rGFP or pGPF, whichserves as reporters for monitoring localization of proteins tosubcellular compartments; assessing intracellular trafficking ofproteins; or examining protein—protein interactions, protein-nucleicacid interactions, and protein interactions with other molecules.

Since protein-interaction domains serve as a basis for many cellularprocesses and cell signaling events, preferred embodiments of thepresent invention further comprise substrates for enzymatic reactions,such as proteases, kinases and phosphatase, and further serve asintracellular biosensors that provide information about thephysiological state of the cell.

In other preferred embodiments, the compositions of the presentinvention are useful as candidate agents in the form of random nucleicacids, cDNAs, cDNA fragments or genomic DNA fragments fused to rGFP orpGFP gene. These GFP fusions provide a basis for monitoring expressionand localization of the candidate agent, and importantly serves as ascaffold for constraining the peptide for presentation in anbiologically active form. In addition, the GFP moiety is useful as arescue sequence and for pulling out cellular targets of the candidateagents.

In these embodiment, the methods outlined herein are used to screen formodulators of cellular phenotypes. Cellular phenotypes that may beassayed include, but are not limited to, cell apoptosis, cell cycle,exocytosis, cytokine secretion, cell adhesion, signal transduction,protein interaction, etc. As will be appreciated by those in the art,any number of cellular assays that rely on rGFP or pGFP and theirvariants can be developed.

In one preferred embodiment, the rGFP or pGFP can be used to evaluate,test and screen promoters. Thus, in this embodiment, the presentinvention provides compositions comprising a promoter of interest and agene encoding a rGFP or pGFP. Alternatively, the compositions comprise apromoter operably linked to a gene of interest, a separation sequence,and a gene encoding rGFP or pGFP. Preferably, the promoter is not thenative rGFP or pGFP promoter.

In a preferred embodiment, the fusion nucleic acids are used to screenfor modulators of promoter activity. By “modulation” of promoteractivity herein is meant increase or decrease in transcription of thefusion nucleic acid regulated by the promoter of interest. Variouspromoters of different organisms are amenable to analysis, includingpromoters of bacterial, yeast, worm, insect, plant, and mammalian cells.In mammalian cells, examples of relevant promoters are IL-4 inducible εpromoter, IgH promoter, NF-kβ regulated promoters, APC/β-cateninregulated promoters, myc regulated promoters, cell specific promoters(peripheral nervous system, central nervous system, kidney, skin, bone,lung, heart, liver, bladder, ovary, testes, colon, etc.), cytokineregulated promoters, stress regulated promoters (e.g., heat shock),circadian rhythm regulated promoters, and promoters regulating HIV viralgene expression and cell cycle genes. Preferred are promoters thatregulate expression of signal transduction proteins, cell cycleregulatory proteins, oncogenes, or promoters which are themselvesregulated by signal transduction pathways, cell cycle regulators, orother aspects of cell regulatory networks.

Candidate agents are contacted with the cells comprising the fusionnucleic acid and examined for effects on reporter gene expression (seefor example, WO 99/58663, hereby expressly incorporated by reference).If the promoter is inducible, promoter is induced with appropriatestimulus or effector. Alternatively, the promoter is induced prior toaddition of the candidate bioactive agents, or simultaneously. Forexample, for the IL-4 inducible ε promoter, addition of cytokine IL-4 orIL-13 to the cells (e.g., IL-4 of not less than 5 units/ml and at apreferred concentration of 200 units/ml) can induce transcription of theε promoter. Screening of candidate agents affecting inducible expressionof the reporter will allow identification of cellular targets involvedin signal transduction events mediated by the cytokine.

To provide a more stringent selection for promoter regulators, thefusion nucleic may comprise a promoter, a rGFP or pGFP, a separationsequence, and a reporter/selection gene distinguishable from rGFP orpGFP. The GFP allows selection of cells expressing the fluorescentprotein while the reporter/selection gene allows an additional basis forselecting cells. In one aspect, the reporter/selection gene may be adeath gene that provides a nucleic acid that encodes a protein causingcell death. It is preferable that cell death require a two step process:expression of the death gene and induction of death phenotype by asignal or ligand. This two step process is desirable when the promoterbeing analyzed is constitutively active. For example, if the selectiongene is a thymidine kinase (TK), the cells can be selected based onkilling by gangcyclovir since TK activity is needed for gangcyclovirtoxicity. Alternatively, the selection gene may encode the heparinbinding epidermal growth factor (HBEGF) protein and the killinginitiated by adding diptheria toxin. Thus, candidate agents that represspromoter activity are readily identified by selecting for cells that areresistant to cell death and lacking in GFP expression. The presence of aseparation sequence, such as Type 2A, allows expression of both reporterand selection genes from a single transcript, thus providing a sensitiveindicator of promoter activity. Verification of the presence of thedeath gene is preferred to keep the levels of false positives low; thatis, cells that survive the screen should be due to the presence of aninhibitor of the promoter rather than a lack of the death gene.

In another preferred embodiment, inducible promoter may be linked to“one step” death genes (e.g., diptheria toxin A fragment). In thisembodiment, the inducible promoter is leaky such that some small amountof death gene and the reporter protein (e.g., rGFP or pGFP) isexpressed. The low level of reporter gene expression allows selection ofcells containing the death gene to avoid false positives. To thesecells, candidate agents are contacted and promoter induced to expressthe death gene. Selection of surviving cells enriches for those cellsthat contain agents inhibiting the promoter.

For examining promoters regulated by specific signal transductionpathways, cells capable of transducing the signal are used. For example,for IL-4 inducible ε promoter system, any cells that express an IL-4receptor that transduces the IL-4 signal to the nucleus and alterstranscription can be used. Suitable cells include, but are not limitedto, human cells and cell lines that show IL-4/13 inducible production ofgermlne ε transcripts, including, but not limited to, DND39 (seeWatanabe, supra), MC-116, (Kumar, et al. (1990) Eur. Cytokine Netw. 1:109), CA-46 (Wang, et al. (1996) J. Natl. Cancer. Inst. 88: 956). As isnoted herein, the ability of MC-116 and CA-46 cells to produce germlne εtranscripts upon IL-4/13 induction was not known prior to the presentinvention. Thus, preferred embodiments provide for MC-116 and/or CA-46cells comprising recombinant nucleic acid reporter constructs asoutlined herein.

In another preferred embodiment, the fusion construct comprises anendogenous promoter and an exogenous rGFP or pGFP gene. By “endogenous”in this context means present within the host cell. In this regard, anexogenous rGFP, pGFP, or variants thereof is incorporated into thegenome such that the reporter gene is under the control of theendogenous promoter. These constructions are desirable for examining andmodulating the full range of endogenous regulation, particularlypromoter control elements (e.g., enhancers, inhibitory elements, etc.)other than promoter fragment.

Generating the endogenous-exogenous fusion construct may proceed in anynumber of ways depending on the organism used. In one preferredembodiment, homologous recombination mechanisms present in differentorganisms provides the basis for inserting the exogenous reporter geneto form the fusion construct. That is, gene “knock-in” constructions aremade, whereby an exogenous rGFP or pGFP gene as outlined herein isadded, via homologous recombination, to the genome, such that thereporter gene is under the control of the endogenous promoter.Homologous recombination methods are well known in the art (seeWestphal, et al. (1997) Current Biology 7: R530–R533 and referencescited therein; Rothstein, R. (1991) Methods Enzymol. 194: 281–301; Kaur,R. (1997) Nucleic Acids Res. 251080–81; and Miller, J. H., In ShortCourse in Bacterial Genetics: A Laboratory Manual and Handbook forEscherichia coli. and Related Bacteria, Cold Spring Harbor LaboratoryPress, Cold Spring Harbor, N.Y., 1992). These homologous recombinationdriven methods may use recA or recA type proteins to enhance therecombination process (see PCT US93/03868, hereby incorporated byreference).

In another preferred embodiment, the selection of the “knock ins” aredone by FACS on the basis of incorporation of the rGFP or pGFP gene.Thus, in one aspect, a first homologous recombination event places arGFP or pGFP gene into at least one allele of the cell genome. When thepromoter is the IL-4 inducible promoter, a cell type that exhibits IL-4inducible production of at least germline ε transcripts is preferred sothat the cells may be tested by IL-4 inducible reporter gene expression.That is, transformed cells are selected by FACS for reporter geneexpression upon treatment with IL-4. Suitable cells include, but are notlimited to, human cells and cell lines that show IL-4/13 inducibleproduction of germline ε transcripts. Preferably, once a firstendogenous promoter has been combined with an exogenous reporterconstruct, a second homologous recombination event may be done,preferably using a second reporter gene different from the first, totarget the other allele of the cell genome, and tested as above.Generally, IL-4 induction of the rGFP or pGFP genes will indicate thecorrect placement of the genes, which can be confirmed via sequencingsuch as PCR sequencing, or by Southern blot hybridization. In addition,preferred embodiments utilize pre-screening steps to remove “leaky”cells, i.e., those showing constitutive expression of the rGFP or pGFPgene.

In another preferred embodiment, endogenous exogenous fusion constructsare made via site specific recombination. In these embodiments, the sitespecific recombination sequence, such as IoxP, is inserted into thedesired site(s), preferably by homologous recombination, although randominsertions are possible with other vectors depending on cell type beingused (e.g., phage Mu, retroviral vectors). Following generation of cellscontaining the site specific sites, a vector comprising the rGFP or pGFPand an appropriately placed loxP site is introduced into the cell.Expressing the cre recombinase allows recombination between the loxPsites on the two separate nucleic acids, thus resulting in insertion ofthe vector into the chromosomally located loxP site.

As above, these cells are induced with the appropriate inducer if theendogenous promoter of interest is inducible and then contacted withcandidate agents. When the cells comprise fusion nucleic acidsexpressing candidate agents comprising rGFP or pGFP fusion proteins, orcandidate agents expressed from a fusion nucleic acid comprising a firstgene of interest, a separation sequence, and a second gene of interestcomprising a rGFP or pGFP, a reporter gene distinguishable from the rGFPor pGFP proteins is used to monitor promoter modulation. This strategyallows simultaneous monitoring of the expression of the candidate agentand the promoter.

In another preferred embodiment, the fusion nucleic acids comprisingrGFP or pGFP and a weak promoter, or no promoter, are inserted into ahost chromosome to scan for promoter elements on the host chromosome. Ina preferred embodiment, this may be done conveniently by using a viralbackbone for constructing the fusion nucleic acids. For example, inbacteria, the phage Mu systems allow random insertions into the hostchromosome while in mammalian cells, retroviral viral vectors provide asuitable vehicle for inserting the fusion nucleic acids into the hostchromosome. When retroviral vectors are used, SIN type vectors lackingviral promoters are preferred so that the reporter gene is transcribedor activated from endogenous promoters or promoter regulatory elementsupon insertion of the viral DNA into the host chromosome. Expression ofrGFP or pGFP indicates insertion near an endogenous promoter.Identifying cells expressing the reporter gene upon treatment withinducers allow identification of promoters regulated by the inducingagent. Cells comprising these insertions are contacted with candidateagents, for example, by expressing candidate nucleic acid or proteins inthe cells. Those agents modulating promoter activity are identifiedbased on expression of the rGFP or pGFP reporter.

In the endogenous-exogenous fusion constructs described above, theexogenous fusion nucleic acid used to monitor promoter activity maycomprise a rGFP or pGFP, or a fusion nucleic acid comprising a firstgene of interest, a separation sequence, and a second gene of interestcomprising a rGFP or pGFP. The latter construct allows identifying cellsbased on expression of two reporter/selection genes if the first gene ofinterest encodes reporter gene distinguishable from rGFP or pGFP.

In addition, in a preferred embodiment, the fusion nucleic acids of thepresent invention may also contain site specific recombination sites fordeleting or rearranging the fusion nucleic acids when introduced into acell. As described above, these sequences may comprise loxP or flp sitesflanking the nucleic acid segment to be rearranged. As is well known inthe art, the sites are placed in an appropriate orientation so thateither deletion or rearrangement (i.e., inversion) will occur uponcontact of the sequences with a site specific recombinase. In apreferred embodiment, the site specific sequences flank the rGFP or pGFPgene or flank the fusion nucleic acid comprising a first gene ofinterest, a separation sequence, and a second gene of interestcomprising rGFP or pGFP. Thus, deletion or rearrangement results inremoval or rearrangement that prevents operable linkage of the promoterto the fusion nucleic acid to be expressed. In another preferredembodiment, the site specific sequences are orientated such thatrearrangement results in operable linkage of the promoter on theexpression vector or the endogenous promoter when rearrangement isinduced by the recombinase. This may be desirable when examiningpromoters active at specific stages in cell development or in examiningcell lineage.

In a preferred embodiment, the fusion nucleic acids of the presentinvention are used to identify candidate agents that alter a cellularphenotype. In these embodiments, the fusion nucleic acids of the presentinvention provide a way, among others, for detecting or monitoring acellular phenotype, inducing a phenotype being examined, and measuringsynthesis of a gene of interest, such as candidate agents to bescreened.

Accordingly, in one preferred embodiment, the fusion nucleic acids ofthe present invention find use in screens for cells with alteredexocytosis. By “alteration” or “modulation” in relation to exocytosis ismeant a decrease or increase in amount or frequency of exocytosis in onecell compared to another cell, or in the same cell under differentconditions. Often mediated by specialized cells, exocytosis is vital fora variety of cellular processes, including neurotransmitter release byneurons, hormone release by adrenal chromaffin cells (e.g., adrenaline)and pancreatic β-cells (e.g., insulin), and histamine release byB-cells.

Disorders involving exocytosis are numerous. For example, inflammatoryimmune response mediated by mast cells leads to a variety of disorders,including asthma and allergies. Therapy for allergy remains limited toblocking mediators released by mast cells (e.g., antihistamines) andnon-specific anti-inflammatory agents, such as steroids and mast cellstabilizers. These treatments are only marginally effective inalleviating the symptoms of allergy. To identify cellular targets fordrug design or candidate effectors of exocytosis, the fusion nucleicacids expressing GFP fusion proteins (e.g., fused to random peptides) orexpressing gene of interest comprising candidate agents may beintroduced into appropriate cells, for example mast cells, and selectedfor modulation of exocytosis by assaying for changes in cellularexocytosis properties under various conditions. For example, the cellsmay be examined in the presence or absence of physiological signals,such as Ca⁺², ionophores, hormones, antibodies, peptides, drugs,antigens, cytokines, growth factors, membrane potentials, cell—cellcontacts, and the like. In other aspects, the measurements are takenunder the same conditions for different cells. These cells arestimulated with appropriate inducer if exocytosis is triggered by aninducing signal. Alternatively, cells with an conditional mutation forexocytosis events are used in screens for candidate agents affectingexocytosis regulators.

In one preferred embodiment, the cells used for screening may beengineered to be defective in exocytosis. For example, cells may betransformed with a fusion nucleic acid expressing a conditional geneproduct whose expression under restrictive conditions produces anexocytosis defect. Alternatively, the fusion nucleic acid may express adominant effect protein affecting exocytosis. Examples of these types ofgenes of interest are dynamin and Ese1, proteins involved in endocytosisbut which indirectly affect exocytosis. Expression of temperaturesensitive conditional mutants of dynamin or Ese1 in cells can induceendocytosis and exocytosis defects (Damke, H. et al. (1995) J. CellBiol. 131: 69–80; Damke, H. et al. (1994) J. Cell Biol. 127: 915–934;Sengar, A. S. (1999) EMBO J. 18: 1159–1171). Thus, in a preferredembodiment, the cell may comprise a fusion nucleic acid containing aconditional dynamin gene, a separation sequence, and reporter genecomprising rGFP or pGFP. Expression of dynamin gene under restrictivecondition disrupts endocytosis, thus resulting in deficiency inexocytosis. Candidate agents are screened under the restrictivecondition for activation of exocytosis. When candidate agents compriseGFP fusion proteins (e.g., random peptide or cDNA GFP fusions), or areexpressed as a first gene of interest, a separation sequence, and asecond gene of interest comprising rGFP or pGFP, the reporter genechosen is distinguishable from the expressed GFP.

Assays for changes in exocytosis may comprise sorting cells in afluorescence cell sorter (FACS) by measuring alterations of variousexocytosis indicators, such as light scattering, fluorescent dye uptake,fluorescent dye release, granule release; targeting and quantity ofgranule specific proteins (see for example, WO 99/54494); andcapacitance measurements. Use of combinations of indicators reducesbackground and increases specificity of the sorting assay.

Exocytosis assays based on changes in light scattering properties,including use of forward and side scatter properties of the cells, areindicative of size, shape, and granule content of the cell.Multiparamter FACS selections based on light scattering properties ofcells are well known in the art (see Paretti, M. et al. (1990) J.Pharmacol. Methods 23: 187–94; Hide, I. et al. (1993) J. Cell Biol. 123:585–93).

Assays based on uptake of fluorescent dyes reflect the coupling ofexocytosis and endocytosis. In these assays, the endocytosis levelsindirectly reflect exocytosis levels since the cell attempts to maintaincell volume and membrane integrity as the amount of cell membranerapidly changes when secretory vesicles fuse with the cell membrane.Preferred fluorescent dyes include styryl dyes, such as FM1-43, FM4-64,FM14-68, FM2-10, FM4-84, FM1-84, FM14-27, FM14-29, FM3-25, FM3-14,FM5-55, RH414, FM6-55, FM10-75, FM1-81, FM9-49, FM4-95, FM4-59, FM9-40,and combinations thereof. Styryl dyes such as FM1-43 are only weaklyfluorescent in water but highly fluorescent when associated with amembrane such that dye uptake by endocytosis is readily discernable(Betz, et al. (1996) Current Opinion in Neurobiology 6: 365–371;Molecular Probes, Inc., Eugene, Oreg., “Handbook of Fluorescent Probesand Research Chemicals”, 6th Edition, 1996, particularly, Chapter 17,and more particularly, Section 2 of Chapter 17, hereby incorporatedherein by reference). Useful solution dye concentration is about 25 to1000–5000 nM, with from about 50 to about 1000 nM being preferred, andfrom about 50 to 250 being particularly preferred.

Exocytosis assays based on fluorescent dye release rely on release ofdye that is taken up passively or actively endocytosed by the cell.Release of dyes taken up by a cell results in decreased cellularfluorescence and presence of the dye in the cellular medium, thusproviding two basis for measuring dye release. For example, styryl dyetaken up into cells by endocytosis is released into the cellular mediaby exocytosis, resulting in decreased cellular fluorescence and presenceof the dye in the medium. Another dye release assay uses low pH dyes,such as acridine orange, LYSOTRACKER™ red, LYSOTRACKER™ green, andLYSOTRACKER™ blue (Molecular Probes, supra), which stains exocyticgranules when dye is internalized by the cell.

Alternatively, the exocytosis assay relies on release of moleculescontained in the granule. In one aspect, these may be proteins ordetectable biomolecules, especially enzymes such as proteases andglyocosidases, released as part of the exocytic process. Many enzymesare inactive within the granule because of low pH in the vesicle butbecome activated when exposed to the extracellular media atphysiological pH. Preferred granule enzymes include but are not limitedto chymase, tryptase, arylasulfatase A, β-hexosaminidase,β-D-galactosidase, and the like. Enzyme activities are measurable usingchromogenic or fluorogenic substrates. The generation of a signal viacleavage of a chromagenic or fluorogenic substrate is related to theamount of enzyme present, and thus a measure of exocytosis. If theexocytosis is inducble, an inducing signal is used.

The fluorogenic substrate may be a substrate that precipates upon actionby the enzyme. For example, substrate for glucouronidase, such as ELF-97glucouronide, precipitate through action of released enzyme. Otherprecipitating substrates are well known in the art and commerciallyavailable (see for example, Molecular Probes, supra, particularlyChapter 10, more particularly Section 2 or Chapter 10, and referencedrelated chapters). When the granule specific proteins comprisesbiological mediators released during exocytosis, such as serotonin,histamine, heparin, hormones, etc., these granule proteins may beidentified using specific antibodies.

Preferential staining of exocytic granules when vesicles fuse with thecell membrane provides an additional assay for measuring exocytosis.Annexin V, which binds phospholipid phosphatidyl serine in a divalention dependent manner, specifically binds to exocytic granules present onthe cell surface but fails to bind internally localized exocyticgranules. This property of Annexin provides a basis for determiningexocytosis by the level of Annexin bound to cells. Cells show anincrease in Annexin binding in proportion to the time and intensity ofthe exocytic response. Annexin is detectable directly by use offluorescently labeled Annexin derivatives (e.g., FITC, TRITC, AMCA, APC,or Cy-5 fluorescent labels), or indirectly by use of Annexin modifiedwith a primary label (e.g., biotin), which is detected using a labeledsecondary agent that binds to the primary label (e.g., fluorescentlylabeled avidin). In general, changes of 25% from baseline are preferred,with at least about 50% being more preferred, at least about 100% beingparticularly preferred and at least about 500% being especiallypreferred. Baseline as used herein means the amount of Annexin bindingas compared to binding under a second state or different cell.

Alternatively, in a preferred embodiment the exocytosis indicators areengineered into the cells. For example, recombinant proteins comprisingfusion proteins of a granule specific, or a secreted protein, and areporter molecule are expressed in a cell by transforming ortransfecting the cells with a fusion nucleic acid encoding the fusionprotein. This is generally done as is known in the art, and will dependon the cell type. Generally, for mammalian cells, retroviral vectors,including those of the present invention, are preferred for delivery ofthe fusion nucleic acid. Preferred reporter molecules include, but arenot limited to, Aequoria victoria GFP, Renilla muelleri GFP, Renillareniformis GFP, Renilla ptilosarcus, GFP, BFP, YFP, and enzymesincluding luciferases (e.g., Renilla, firefly etc.) andβ-galactosidases. Presence of the granule protein-reporter fusionconstruct on the cell surface or presence of secreted protein-reporterfusion construct in the medium indicates the level of exocytosis in thecells. In one preferred embodiment, cells are transformed with vectorsexpressing a fusion protein comprising a granule specific protein, suchas synaptobrevin (VAMP) or synaptotagmin, fused to a GFP reportermolecule. The cells are monitored for localization of the fusion proteinto the cell membrane. By incorporating a separation sequence and asecond gene of interest comprising a distinguishable reporter orselection gene, cells expressing the fusion protein are readilyselected. Moreover, the second gene of interest provides an internalstandard to measure level of fusion protein content in the cell.Candidate agents, for example candidate nucleic acids and candidatepeptides, introduced into these transformed cells are tested for theirability to affect distribution of the fusion protein. Alternatively, thefusion protein is detected, directly or indirectly, using an antibody.

In another preferred embodiment, the methods are used to examine cellcycle regulation. Complicated regulatory pathways control cell cycleprogression. These regulatory molecules include, among others, cellularreceptors, cyclins, cyclin dependent kinases, cyclin dependent kinaseinhibitors, cell division cycle phosphatases (CDC), ubiquitin ligasesand ubiquitin mediated proteases, tumor suppressor proteins (e.g., cellcycle checkpoint regulators), and transcription factors. Cell cycleregulation is implicated in tumor formation and immune systemregulation. The compositions of the present invention are used toidentify candidate agents producing an altered cell cycle phenotype,such as activation or suppression of cell cycle checkpoint. In oneaspect, the candidate agents are fusion nucleic acids expressingcandidate peptides fused to rGFP or pGFP. These candidate agents areintroduced into cells in the form of vectors, preferably retroviralvectors when mammalian cells are used. In another aspect, the candidateagents are nucleic acids, peptides, cDNAs, and genomic DNAs expressed asa gene of interest. When these candidate agents comprise peptides andproteins, the fusion nucleic acid may further comprise a separationsequence and a rGFP or pGFP to produce separate proteins and to monitorexpression of the candidate agent.

In another preferred embodiment, the fusion nucleic acids of the presentinvention is used to express cell cycle regulators or express mutants ofcell cycle regulatory proteins which produce a cell cycle phenotype inthe cells. In one aspect, the fusion nucleic acids may comprise a geneof interest comprising a cell cycle regulator, which induces a cellcycle phenotype when expressed. A separation sequence and a reportergene, such a rGFP or pGFP allows monitoring expression of the gene ofinterest. When the candidate agent comprises rGFP or pGFP fusionproteins or when the candidate agent is expressed from a fusion nucleicacid comprising a first gene of interest, a separation sequence, and asecond gene of interest comprising rGFP or pGFP, a distinguishablereporter gene (e.g., blue fluorescent protein) is used to monitorexpression of the cell cycle regulator. Candidate agents are thenintroduced into the cells to identify those agents altering the inducedcell cycle phenotype.

The cell cycle may be examined by a variety of methods well known tothose skilled in the art (see for example, US 2001/0003042, which isexpressly incorporated by reference). The assays permit determiningwhether cell cycle arrest occurs at a particular cell cycle stage (i.e.,cell proliferation assays) and at a specific cell stage (i.e., cellphase assays). By measuring or assaying one or more of these parameters,it is possible to detect alterations in cell cycle regulation and alsoalteration of different steps of the cell cycle regulatory pathway. By“alteration” and “modulation” as used herein can include both increasesand decreases in the cellular parameter being measured. In a preferredembodiment, the alteration results in a change in the cell cycle of acell, i.e., proliferating cell arrests in any one of the phases, or anarrested cell moves out of its arrested phase to progress into cellcycle as compared to another cell or the same cell under differentconditions. Alternatively, the progress of a cell through any particularphase may be altered; that is, there may be an acceleration or delay inthe time for the cell to move through a particular growth phase.

In a preferred embodiment a proliferation assay is used. By“proliferation assay” herein is meant an assay that allows determiningwhether a cell population is proliferating, i.e., replicating or notreplicating. In one preferred embodiment, the proliferation assay is adye exclusion assay. A dye exclusion assay relies on uptake of dye bycells and subsequent dilution of the dye by succeeding rounds of celldivision. Generally, the introduction of dye may be done in severalways. Either the dye cannot passively enter the cells (e.g., dye ischarged), and the cells are induced to take up the dye. Alternatively,the dye passively enters the cells and is subsequently modified to limitdiffusion out of the cells. For example, Molecular Probes CellTrackerdyes comprise chloromethyl derivatives of fluorescent compounds thatfreely diffuse into cells and are subsequently modified by glutathioneS-transferase, which renders the dyes membrane impermeant. Suitableinclusion dyes include, but are not limited to, CellTracker dyesincluding, but not limited to CellTracker Yellow-Green, CellTrackerGreen, CellTracker Orange, PKH26 (Sigma), and others well known in theart (see Molecular Probes Handbook, supra).

In another preferred embodiment, the proliferation assay is anantimetabolite assay. In general, antimetabolite assays are most usefulwhen agents causing cell cycle arrest at G1 or G2 resting phase isdesired. In an antimetabolite assay, the use of a toxic metabolite thatwill kill dividing cells will result in survival of only those cellsthat are not dividing. Suitable antimetabolites include, but are notlimited to, standard chemotherapeutic agents such as methotrexate,cisplatin, taxol, hydroxyurea, and nucleotide analogs (e.g., AraC). Inaddition, antimetabolite assays may include the use of genes that causecell death upon expression.

The concentration at which the antimetabolite is added will depend onthe toxicity of the particular antimetabolite, and will be determined asis known in the art. The antimetabolite is added and the cells aregenerally incubated for some period of time; again, the exact period oftime will depend on the characteristics and identity of theantimetabolite as well as the cell cycle time of the particular cellpopulation. Generally, the incubation time is sufficient for at leastone cell division. In a preferred embodiment, at least one proliferationassay is done, with more than one being preferred.

In another preferred embodiment, either after or simultaneously with oneor more of the proliferation assays outlined above, at least one cellphase assay is done. By “cell phase” assay herein is meant an assay thatdetermines at which cell phase cell cycle arrest takes place, i.e., M,G1, S, or G2.

In one preferred embodiment, the cell phase assay is a DNA bindingassay. When inside the cell, the dye binds to DNA, generally byintercalation, although in some cases, the dyes can be either major orminor groove binding compounds. Thus, the amount of dye is directlycorrelated to the amount of DNA in the cell, which varies with cellphase; G2 and M phase cells have twice the DNA content of G1 phasecells, and S phase cells have an intermediate amount. Suitable DNAbinding dyes include, but are not limited to, Hoechst 33342 and 33258,acridine orange, 7AAD, LDS, 751, DAPI, and SYTO 16 (see Molecular ProbesHandbook, supra, Chapters 8 and 16 in particular).

In general, the DNA binding dyes are added in concentrations rangingfrom about 1 μg/ml to about 5 μg/ml. The dyes are added to the cells andallowed to incubate for some period of time; the length of time willdepend in part on the dye chosen. In one embodiment, measurements aretaken immediately after addition of the dye. The cells are then sortedas outlined below, to create populations of cells that contain differentamounts of dye, and thus different amounts of DNA; in this way, cellsthat are replicating are separated from those that are not. As will beappreciated by those in the art, in some cases, for example whenscreening for anti-proliferation agents, cells with the leastfluorescence (and thus a single copy of the genome) can be separatedfrom those that are replicating since the replicating cells contain morethan a single genome of DNA. Alterations are determined by measuring thefluorescence at either different time points or in different cellpopulations, and comparing the determinations to one another or tostandards.

In a preferred embodiment, the cell phase assay is a cyclin destructionassay. In this embodiment, prior to screening, and generally prior tothe introduction of a candidate bioactive agent, a fusion nucleic acidis introduced to the cells. The fusion nucleic acid expresses a fusionprotein comprising a cyclin destruction box and a detectable molecule.“Cyclin destruction boxes” are known in the art and are sequences thatcause destruction via the ubiquitination pathway of destruction boxcontaining proteins during particular cell phases. That is, for example,G1 cyclins may be stable during G1 phase but degraded during S phase dueto the presence of a G1 cyclin destruction box. Thus, by linking acyclin destruction box to a detectable molecule, for example greenfluorescent protein, the presence or absence of the detectable moleculecan serve to identify the cell phase of the cell population. In apreferred embodiment, multiple boxes are used, preferably each fused todistinguishable fluorescent proteins, such that detection of the cellphase can occur.

A number of cyclin destruction boxes are known in the art. For example,cyclin A has a destruction box comprising the sequence RTVLGVIGD (SEQ IDNO: 75) while the destruction box of cyclin B1 comprises the sequenceRTALGDIGN (SEQ ID NO: 65) (Glotzer et al. (1991) Nature 349: 132–138.Other destruction boxes are known as well: YMTVSIIDRFMQDSCVPKKMLQLVGVT(SEQ ID NO: 76)(rat cyclin B); KFRLLQETMYMTVSIIDRFMQNSCVPKK (SEQ ID NO:77); RAILIDWLIQVQMKFRLLQETMYMTVS (SEQ ID NO: 78)(mouse cyclin B1);DRFLQAQLVCRKKLQWGITALLLASK (SEQ ID NO: 79)(mouse cyclin B2); andMSVLRGKLQLVGTAAMLL (SEQ ID NO: 80)(mouse cyclin A2). These cyclindestruction boxes are operably linked to nucleic acid encoding adetectable molecule to generate fusion proteins, as described above.

In a preferred embodiment, the cell cycle analysis further comprises acell viability assay to ensure that a lack of cellular change is due toexperimental conditions. Various suitable viability assays include, butare not limited to, light scattering, viability dye staining, andexclusion dye staining.

In a preferred embodiment, the viability assay is a light scatteringassay, which is well known in the art. Cells have particular forward andside (90 degree) scatter properties representing the size, shape andgranule content of the cells. Briefly, the scatter properties areaffected by two parameters: side scatter of DNA condensation in dead anddying cells and the forward scatter affected by the state of membraneblebbing. Changes in the intensity of light scattering or the cellrefractive index indicate alterations in viablity. In a preferredembodiment, evaluating a live cell population of a particular cell typeprovides characteristic forward and side scatter properties forcomparison to other cell populations.

In another preferred embodiment, the viability assay uses a viabilitydye. These dyes stain dead or dying cells but not growing cells. Forexample, Annexin V displays divalent ion dependent binding to thephospholipid phosphatidylserine, whose presence on the cell surface isan early signal of apoptosis. Other suitable viability dyes include, butare not limited to, ethidium homodimer-1, DEAD Red, propidium iodide,SYTOX Green, etc., and others known in the art (see Molecular Probes,supra “Apoptosis Assay,” pg 285, and Chapter 16, hereby incorporated byreference). Preferably, the viability dye concentration used is about100 ng/ul to about 500 ng/ml, and more preferably, from about 500 ng/mlto about 1 ug/ml, most preferably about 500 ng/ml to about 1 ug/ml, andfrom about 1 ug/ml to about 5 ug/ml being particularly preferred. In apreferred embodiment, the dye is directly labeled. For example, Annexinmay be labeled with a fluorophore such as fluorescein isothiocyanate(FITC), Alexa dyes, TRITC, AMCA, APC, tri-color Cy-5, and others knownin the art. In an alternative preferred embodiment, the viability dye islabeled with a first label (e.g., hapten or biotin), and a secondaryfluorescent label is used to detect the first label.

In another preferred embodiment, the viability assay is a dye exclusionassay. Exclusion dyes rely on exclusion of the dye from living cells butentry into permeable dead or dying cells. Generally, the exclusion dyesbind to DNA and fluoresces but fluoresces poorly when not bound to DNA.Alternatively, exclusion dyes are detected using a secondary label.Preferred exclusion dyes include, but are not limited to ethidiumbromide, ethidium bromide homodimer-1, propidium iodide, SYTOX Green,calcein AM, BBCECF AM, fluoresceine diacetate, TOTO, and TO-PRO (seeMolecular Probes, supra) and others known in the art. These dyes areadded to cells at a concentration of about 100 ng/ml to about 500 ng/ml,more preferably, about 500 ng/ml to about 1 ug/ml, and most preferably,from about 0.1 ug/ml to about 5 ug/ml, with about 0.5 ug/ml beingparticularly preferred. In addition, other cell viability assays areused, including assays that measures extracellular (e.g., proteases) orintracellular (e.g., mitochondrial enzymes) enzymes of live and deadcells.

In a preferred embodiment, at least one cell viability assay is run,with at least two different cell viability assays being preferred. Whenonly one viability assay is run, a preferred embodiment uses lightscattering assays (both forward and side scatter). When two viabilityassays are run, preferred embodiments use light scattering and dyeexclusion or light scattering and viability dye staining. In some cases,all three assays are used.

Thus, in a preferred embodiment, cell cycle assays comprise sortingcells in a FACS by assaying several different cellular parameters,including, but not limited to, cell viability, cell proliferation, cellphase, and appropriate combinations thereof. The results from one ormore of the assays are compared to cells not exposed to the candidatebioactive agent.

In the present invention, assays for other cellular assays are combinedwith the cell cycle assay. These include cellular parameters of cellshape, redox state, DNA content, nucleic acid sequence, chromatinstructure, RNA content, total protein, antigens, lipids, surfaceproteins, intracellular receptors, oxidative metabolism, DNA synthesis,degradation, intracellular pH, etc. In a preferred embodiment, each ofthese measurements is determined simultaneously or sequentially usingFACS (i.e., multiparameter FACS). By using more than one parameter todetect the cell cycle, background is reduced and specificity isincreased. In one aspect, the cells are sorted at high speeds, forexample greater than about 5,000 sorting events/s, with greater thanabout 10,000 sorting events/s being preferred, and greater than about25,000 sorting events/s being particularly preferred, with speeds ofgreater than about 50,000 to 100,000 being especially preferred.

In another preferred embodiment, the present methods are useful incancer applications. The ability to rapidly and specifically kill tumorcells is a cornerstone of cancer chemotherapy. In general, using themethods of the present invention, the fusion nucleic acids of thepresent invention can be introduced into any tumor cell, primary orcultured, to identify bioactive agents that can induce apoptosis, celldeath, loss of cell division, or deceased cell growth. The methods ofthe present invention can be combined with other cancer therapeutics(e.g., drugs or radiation) to sensitize the cells, and thus induce rapidand specific apoptosis, cell death or descreased growth after exposureto secondary agent. Similarly, the present invention may be used inconjunction with known cancer therapeutics to screen for agonists tomake the therapeutic treatments more effective or less toxic. This isparticularly preferred when the chemotherapeutic agent is difficult orexpensive to produce, such as taxol.

In a preferred embodiment, the present invention is used to identifycandidate agents that alter the transformed phenotype of cancer cells.It is well known that oncogenes such as v-Abl, v-Src, v-Ras, and othersinduce a transformed phenotype leading to abnormal cell growth whentransfected into certain cell types. Loss of growth control is also amajor problem associated with metastasis of transformed cells. Thus, ina preferred embodiment, susceptible, non-transformed cells can betransformed with these oncogenes, and then candidate agents introducedinto these cells to select for bioactive agents which reverse or correctthe transformed state.

One of the identifying features of oncogenic transformation is a loss ofcontact inhibition and the ability to grow in soft-agar. Thischaracteristic provides one method for identifying candidate agents thatalter the transformed phenotype of tumor cells. In this assay,transforming viruses are constructed containing v-Abl, v-Src, or v-Ras,a separation sequence, and a puromycin selection gene. Followingintroduction of the viral constructs into NIH3T3 cells, the cells aresubjected to puromycin selection. The NIH 3T3 cells hypertransform anddetach from the plate, which allows their removal by washing with freshmedium. This feature can serve as a basis for a screen since cells thatexpress a bioactive agent altering this phenotype will remain attachedto the plate and form colonies.

Similarly, the growth and/or spread of certain tumor cell types isenhanced by stimulatory responses from growth factors and cytokines(e.g., PDGF, EGF, Heregulin, and others), which bind to receptors on thesurfaces of specific tumors. In a preferred embodiment, the presentinvention is used to identify candidate agents capable of blocking theability of growth factors or cytokines to stimulate the tumor cell. Thisscreen comprises introducing the fusion nucleic acids expressingcandidate agents followed by selecting for agents that block thebinding, signaling, phenotypic and/or functional responses to thesetumor cells to the subject growth factor or cytokine.

Similarly, the spread of cancer cells by tumor cell invasion ormetastasis presents a significant problem in success of cancertherapies. The ability to restrict or inhibit the migration of specifictumor cells would provide a significant advance in the therapy ofcancer. Tumor cells known to have high metastatic potential can havecandidate agents introduced into them, and agents selected that inhibitmigrative or invasive activity of the tumor cells. The present inventionprovides compositions for following the migration of cells, for exampleby expressing rGFP or pGFP in cells and examining invasive activity.Alternatively, the rGFP or pGFP fusion proteins are used to monitorcellular components involved in cell migration, such as cellular actinor focal adhesion proteins. Candidates agents may be introduced intothese cells to identify agents that affect the invasive or metastaticproperties of the tumor cells. These and other particular applicationsof inhibition of metastatic phenotype could allow specific inhibition ofmetastasis. This may include, for example, candidate agents thatupregulate metastasis suppressor gene NM23, which codes for adinucleoside diphosphate kinase. Peptides that counteract oncogenes,such s v-Mos, v-Raf, a-Raf, v-Src, v-Fes, and v-FMS, or inhibit therelease or activaty of matrix metalloproteinases would also act asanti-metastatic agents.

In a preferred embodiment, the present invention finds use inimmunologic and inflammatory applications. Selective regulation of Tlymphocytes is a desired goal for modulating immune mediated diseases.Thus, candidate agents of the present invention can be introduced intospecific T-cell subsets (TH1, TH2, CD4+, CD8+, etc.) and examined forcharacteristic responses, for example cytokine generation, cytotoxicity,proliferation, and others. Agents can be selected that increase ordescrease the known T-cell physiologic response. For monitoring theseresponses, the present invention may also be used as markers ofphysiologic response, for example by fusing rGFP or pGFP operably fusedto promoters of cytokines that are regulated as part of the immuneresponse. Candidate agents that affect regulation of the cytokinepromoters can be screened on basis of expression of rGFP or pGFP. Theseapproaches will be useful in any number of conditions, including: 1)autoimmune disease states where inducing tolerant state is desirable; 2)allergic diseases where decreasing the stimulation of IgE producingcells is desirable (e.g., blocking release from T-cell subsets ofspecific B-cell stimulating cytokines that induce switch to IgEproduction); 3) transplantation of organs where it is desirable toinduce selective immunosuppression or prolong functioning of thetransplanted organ; 4) in lymphoproliferative states for inhibitinggrowth or to sensitize a specific T-cell tumor to chemotherapy and/orradiation; 5) in tumor surveilllance for inhibiting the elimination ofcytotoxic T-cells via Fas ligand bearing tumor cells; and 6) in T-cellmediated autoimmune or inflammatory diseases such a rheumatoidarthritis, multiple sclerosis, inflammatory bowel disease, myastheniagravis, systemic lupus erythematosus, early onset diabetes, etc.

In a preferred embodiment, the present invention is applicable inselective modulation of B-cell response. Activation of B-cells initiatesvarious facets of humoral immunity, including immunoglobulin synthesisand antigen presentation by B-cells. Activation is mediated byengagement of the B-cell receptor (BCR), for example by binding ofanti-IgM F(ab′) fragments. Activation induces several signaltransduction pathways leading to various B cell responses, includingapoptosis, expression of cell surface marker CD69, and modulation of IgHpromoter activity. Thus, in a preferred embodiment, candidate agentscomprising the fusion nucleic acids of the present invention areintroduced into appropriate B-cell lines, such as Ramos Human B-celllines, Ml 2.4, etc., to identify candidate agents affecting thesignaling pathways activated by B-cell receptor engagement. The assaymay comprise determining the level of CD69 cell surface marker (e.g., byfluorescently labeled anti-CD69 antibody and FACS selection of cellsexpressing high levels of CD69) or inhibition of apoptotic pathway(i.e., inhibition of cell death) following receptor activation. In oneaspect, the candidate agents may be fusion nucleic acids expressingcandidate peptides fused to rGFP or pGFP. These candidate agents areintroduced into cells in the form of vectors, preferably retroviralvectors when mammalian cells are used. In another aspect, the candidateagents are nucleic acids, peptides, cDNAs, and genomic DNAs expressed asa gene of interest using the fusion nucleic acids described herein.

In another aspect, the present invention finds use as indicators ofB-cell receptor mediated signal transduction. An IgH promoter may beoperably linked to a rGFP or pGFP, which allows monitoring of BCRactivation by providing a measure of IgH promoter activity. For example,the promoter reporter construct may comprise a fusion nucleic acidcomprising a first gene of interest comprising a HBEGF, a Type 2Aseparation sequence, and a second gene of interest comprising rGFP orpGFP fused to a PEST sequence. Candidate agents are introduced intocells carrying this construct to identify agents that activate orsuppress BCR mediated signal transduction, as reflected in changes inIgH promoter activity. Cells that survive exposure to diptheria toxinand/or have low levels of GFP expression will have low IgH promoteractivity. Expression of the candidate agents may be under the control ofan inducible promoter, such as tetP, thus limiting any detrimentaleffect of constitutively expressing candidate agents.

In a preferred embodiment, the present invention is used in infectiousdisease applications. Viral pathogens can produce chronic or acuteinfections leading to severe, disabling health effects, and death.Pathogenic viruses, such as human immunodeficiency virus,cytomegalovirus, leukemia viruse, hepatitis virus, herpes virus, amongothers, are epidemic throughout the world. There is a need forunderstanding the infection process and identifying agents affectingpropagation of the virus. In a preferred embodiment, the presentinvention is used to follow and track virus infection of cells. This isdone in a number of ways. In one aspect, rGPF or pGFP are fused to aprotein synthesized by the pathogenic organism. For viruses, fusions maybe made to viral capsid or envelope proteins since these proteins cantolerate substantial modifications and still be incorporated into theviral particle. The fusions allow monitoring of infected cells, trackingof synthesized viral particle in the cell, and determining the presenceof viral particles extruded from the cell. Other viral structuralproteins suitable for fusions include the tegument proteins, which formsa structure generally located between the capsid and the envelope.Alternatively, the fusion nucleic acid comprising rGFP or pGPF gene isinserted into the viral genome, for example by homologous recombination,such that expression is driven by a viral promoter. Viral infection ofcells results in expression of the reporter molecule, thus allowingmonitoring of the infection process.

Analogously, cell lines are constructed in which a viral promoter isoperably linked to a fusion nucleic acid comprising rGFP or pGFP. Uponinfection of the cell by a virus, the viral promoter is activatedresulting in fluorescent reporter gene expression. Consequently,expression of the GFP provides a measure of viral infection. A varietyof viral promoters may be used. These may include immediate early genepromoter of many viruses or the viral promoters present on the longterminal repeats of pathogenic retroviruses (e.g., HIV). Cellularpromoters modulated by viral infection may also be used.

The modified viruses and cells containing the described fusion nucleicacids are then used to identify candidate agents capable of affectingthe infection process, for example, agents capable of inhibiting viralsynthesis. Candidate agents are contacted with the cells and infectedwith the modified viruses. Candidate agents that lower the amount ofvirus produced or affect the promoters regulated by the infectionprocess can be identified. When the candidate agents are the part of afusion nucleic acid comprising rGFP or pGFP, the reporter gene selectedfor tracking and examining the infection process is a reporterdistinguishable from rGFP or pGFP.

Many cellular pathogens are known to exist intracellularly. For example,mycobacteria, rickettsia, salmonella, pneumocystis, yersinia,leishmania, Trypanosoma cruzi, and the like can persist and replicatewithin cells such as marcrophages and lymphocytes. In a manner similarto tagging pathogenic viruses described above, the fusion nucleic acidscomprising rGFP or pGFP are used to mark or tag the pathogenic organism.As with viruses, marking or tagging these non-viral entities may be donein a number of ways. In one aspect, a fusion nucleic acid comprising apromoter active within the organism, such as a promoter that regulatesexpression of a protein required for infection, is operably linked tofusion nucleic acids comprising rGFP or pGFP. These constructs areinserted into the organism by various methods, for example by homologousrecombination. Alternatively, the expression vectors may be maintainedextrachromosomally by expression of a selection gene followed bytreatment of the organism under selection conditions.

These marked or tagged organisms are used to infect appropriate cells orhost organisms. The infection process may be tracked by monitoringexpression of the reporter gene. Cells harboring the marked pathogensare readily identified. Candidate agents are contacted with these cellsto identify agents that affect the infection process. Bioactivecandidate agents may be selected for their ability to eliminate or killthe intracellular organisms, similar to the antibiotic peptide maganin.Other assays include selecting for agents that prevent initialinfection, confer resistance to infection of the host cell, inhibitreplication of the pathogen, or increase susceptibility of infectedcells for destruction by host defense mechanisms (e.g, immune response).

For example, some viruses use cellular receptors and receptor complexesto bind and enter cells. For instance, HIV binds CD4 complexes,coroviruses bind CD13, and measles virus binds CD44 receptors to infectcells. It is desirable to identify agents that block viral infection incells permissive for viral infection. In a specific example, it is knownthat entry of HIV-1 into cells requires CD4 and a co-receptor, which canbe one of several seven transmembrane G-protein coupled receptors. Inthe case of macrophages, the co-receptor required for HIV infection isCCR-5. Individuals homozygous for a mutant allele of CCR-5 are resistantto HIV infection, and natural ligands of CCR-5, for example, CCchemokines RANTES, MIP1a and MIP1b can confer CD8⁺ mediated resistanceto HIV infection. Thus, agents that inhibit interaction between theCD4/CC5 receptor complex and HIV are desirable. In a preferredembodiment, the agents are inserted into the membrane and displayedextracellularly. In one aspect, a library of candidate peptides maycomprise an epitope tagged, glycine-serine tethered peptides, which is alibrary of cyclized peptides of the general sequence CXXXXXXXXXXC orC-(X)_(n)-C, where C is cysteine and X is any amino acid. Cellsexpressing the CD4/CCR-5 complex are contacted with a library of thecandidate peptide, and infected with the viruses described above. Cellsthat are not infected with viruses are identified by FACs and thecandidate agent conferring resistance to infection identified. Theseagents are then further assayed for their ability to inhibit viralinfection.

In another preferred embodiment, the present invention is used to findcandidate agents affecting separation sequences used in variousbiological processes, including, but not limited to, cell death, viralpathogenesis, expression of cellular genes resulting in cell diseasestates, processing of cellular proteins, mechanism of action ofbacterial toxins (e.g., botulinum toxin), and the like. In one aspect,when the separation sequences are protease recognition sequences, thefusion nucleic acids of the present invention are used to expresssubstrates to detect protease activity, as described above. Thesubstrates comprise fusions of protease recognition sequences to rGFP orpGFP. In another embodiment, the protease substrates are based on rGFPor pGFP linked to another fluorescent protein via a protease recognitionsequence to generate substrates capable of undergoing FRET. Preferably,the substrates are codon optimized for the organism in which thesubstrates are expressed to maximize the signal. The protease sitesequences include, among others, those recognized by caspase proteases;viral proteases involved in polyprotein processing, for example the HIVprotease; proteases of bacterial toxins (e.g, botulinum toxin);proteases that process cellular proteins, especially those related todisease states (e.g., secretase processing of β-amyloid and Notchproteins; cathepsins, etc.); proteases regulating cell adhesion (e.g.,metalloproteases associated with extracellular matrix); proteasesinvolved in blood coagulation, inflammation and wound healing; tumorcell associated proteases; and the like. Preferably, the proteasesubstrates are codon optimized for efficient expression in the subjectorganism.

In a preferred embodiment, the present invention finds applications indrug resistance or drug toxicity mechanism. Development of drugresistance in a variety of cell types limits the effectiveness of drugtherapies. For example, multi drug resistance in tumor cells leads toselection of drug resistant tumor cells, which lead to relapse,morbidity, and increased mortality in cancer patients. Thus, in apreferred embodiment, fusion nucleic acids expressing candidate agentsare introduced into drug resistant cells, either primary or cultured.Agents are identified that confer drug sensitivity when cells areexposed to a drug or combinations of drugs. Cells may be selected basedon onset of apoptosis, changes in membrane permeability, release ofintracellular ions, and release of fluorescent markers. Cells in whichmulti-drug resistance involves transporters can be preloaded withfluorescent transporter substrates, and selection carried out forcandidate agents which block normal efflux of fluorescent drugs fromthese cells. Screening of candidate agents affecting drug resistance iswell suited for poorly characterized mechanisms of resistance.Identifying candidate agents that increase susceptibility of these cellsto drugs may provide a basis for identifying the cellular targets andfor rational design of peptide inhibitors of drug resistance pathways.

In another aspect, the present invention is used to identify cellulartargets that regulate synthesis of drug resistance proteins at thetranscriptional or translational levels. In a preferred embodiment,promoters of drug resistance proteins, such as multi-drug resistancetransporters, are operably linked to fusion nucleic acids comprisingrGFP or pGFP. Candidate agents, such as a library of small molecules,random peptides, cDNAs, or genomic DNAs, are introduced into cells andscreened for their ability to regulate drug resistance protein genetranscription. Candidate agents that activate or inhibit transcriptionare identified and used to design other inhibitors or identify thecellular targets of the candidate agents.

In another preferred embodiment, the fusion nucleic acids of the presentinvention are used to confer a drug resistance phenotype in cells byexpressing drug resistance proteins, for example multi-drug resistancetransporters (e.g., P-glycoproteins). The drug resistance protein may beexpressed from a fusion nucleic acid comprising a first gene ofinterest, a separation sequence, and a second gene of interest where atleast one of the genes of interest is the drug resistance gene and theother gene of interest is a reporter gene, such as rGFP or pGFP formonitoring expression. Cells expressing these fusion nucleic acids arecontacted with candidate agents and screened for their ability to reducedrug resistance (i.e., increase drug sensitivity).

In a preferred embodiment, the present invention is useful inidentifying candidate agents that bind specific cells, tissues andorgans. Cells expressing libraries of candidate agents comprising rGFPor pGFP are contacted with cells or introduced into an organism.Candidate agents that bind to specific cells are selected. Thesebioactive candidate agents are useful for targeting coupled antibodies,enzymes, drugs, imaging agents, and the like to particular cells ororgans.

In a preferred embodiment, the present invention provides compositionsand methods utilizing rGFP and/or pGFP and a chip device comprisingintegrated photodetectors at individual loci. The method may bepracticed with any suitable chip device that includes an electroniccircuit capable of reading the sensed signal generated by eachphotodetector and generating output data signals therefrom. The outputdata signals are indicative of the light emitted, due to the presence ofrGFP or pGFP, at the various loci. As will be appreciated by those inthe art, any assay that evaluates binding interactions can utilize thepresent invention. Examples of binding interactions include proteinsinteraction domains, receptors and ligands, drugs and drug targets,enzymes and inhibitors, nucleic acid sequences and nucleic acid bindingproteins, and binding of candidate agents, for example when expressed ona cell surface, to any binding partners above.

It is understood by the skilled artisan that the steps for constructingthe fusion nucleic acids, retroviral libraries, and transformed cellscan be varied according to the options provided herein. It is alsounderstood, however, that the methods and examples in no way limit thetrue scope of the invention. Those skilled in the art may modifyaccording to the skill in the art.

The following examples serve to more fully describe the manner of usingthe above-described invention, as well as to set forth the best modescontemplated for carrying our various aspects of the invention. It isunder stood that these examples in no way serve to limit the true scopeof the present invention, but rather are presented for illustrativepurposes. All references cited herein are incorporated by reference intheir entirety.

EXAMPLES Example 1 Vector Construction and Expression in Mammalian Cells

Retroviral constructs were based on a pCGFP vector that carries acomposite CMV promoter fused to the transcriptional start site of theMMLV R-U5 region of the LTR, an extended packaging sequence, deletion ofthe MMLV gag start ATG, and a multiple cloning region encoding EGFP, anAequoria Victoria GFP variant codon optimized for expression in humancells (Clontech, Palo Alto, Calif.) and a Kozak consensus start,described in Kozak (1986) Cell 44: 283–292. The vector used to expressflag tagged EGFP, pEf, is identical to pCGFP but has additionalrestriction sites in the open reading frame of EGFP (resulting in 8non-human optimized codons) and a Flag tag fused to the C-terminus ofEGFP with the linker EEAAKA.

pR and pP are retroviral expression vectors comprising Renilla muelleriand Ptilosarcus gumeyi GFPs (containing 9 and 11 non-optimized codons,respectively, to introduce restriction sites). Each has a Kozakconsensus start and backbone vector sequence identical to that of pCGFPand pEf. These vectors were made by annealing and ligating 20 syntheticoligonucleotides (10 forward, 10 reverse for each GFP gene) creating adsDNA fragment for each sequence. These fragments were PCR amplifiedwith respective primers:

R forward, 5′-GATCATAGAATTCGCCACCATGGGCAGCAAGCAGATCCTGAAGAACACCTGCCTG;(SEQ ID NO: 82) P forward,5′-GATCATAGAATTCGCCACCATGGGCAACCGCAACGTGCTGAAGAACACCGGCCTG; (SEQ ID NO:83) R and P reverse,5′-ATGATCGCGGCCGCTACACCCACTCGTGCAGGGATCCCAGGGGCTTGCCGATG; (SEQ ID NO:84)and cloned into the EcoRI/NotI restriction sites of pEf (replacing theEf coding region). C-terminal Flag tags were added to these GFPs throughBamHI/NotI sites using annealed primers with sticky overhangs:

Forward, (SEQ ID NO: 85)5′-GATCCCTGCACGAGTGGGTGGAGGAGGCCGCCAAGGCCGACTACAAGGACGACGACGACAAGTAGGCCCGTGAGGCCCTAAGC;Reverse, (SEQ ID NO: 86)5′-GGCCGCTTAGGGCCTCACGGGCCTACTTGTCGTCGTCCTTGTAGTCGGCCTTGGCGGCCTCCTCCACCCACTCGTGCAGG;creating Rf and Pf. pRcDNA was made by removing the R. muelleri cDNAgene from pET-34 Native Renilla muelleri GFP (Prolume Ltd., Pittsburg,Pa.) by PCR amplification with primers:

Forward, 5′- GATCATGAATTCGCCATGAGTAAACAAATATTGAAGAACACT (SEQ ID NO: 87);Reverse, 5′- TAGATCGCGGCCGCTTAAACCCATTCGTGTAAGGATCCTAGTGG (SEQ ID NO:88);and cloning into the EcoRI/NotI sites of pEf. Vectors containing codonoptimized R. muelleri GFP with a linker-HA tag-linker sequence insertedinto each position A–F were created by the PCR sew technique of twofragments using primers shown above R forward and R reverse). The twofragments for A–F were made by PCR amplification of the 5′ section of Rwith respective primers: R forward, shown above;

A reverse,5′-CTGGCGTAGTCGGGCACGTCGTAGGGGTAGCCACCGCCCTGGCCCTCGTAGCGCAGGGTGCGCTCGTAC;(SEQ ID NO: 89) B reverse,5′-CTGGCGTAGTCGGGCACGTCGTAGGGGTAGCCACCGCCCTGGCCCTCGATCAGGTTGATGTCGCTGCGG;(SEQ ID NO: 90) C reverse,5′-CTGGCGTAGTCGGGCACGTCGTAGGGGTAGCCACCGCCCTGGCCGTTCATGTACATGGCCTCGAAGCTG;(SEQ ID NO: 91) D reverse,5′-CTGGCGTAGTCGGGCACGTCGTAGGGGTAGCCACCGCCCTGGCCGTTAAGCTTGTACACAGGATCACC;(SEQ ID NO: 92) E reverse,5′-CTGGCGTAGTCGGGACGTCGTAGGGGTAGCCACCGAAATGGAAGAAATTGCTCTTCATCAGGGTCTTC;(SEQ ID NO: 93) F reverse,5′-CTGGCGTAGTCGGGCACGTCGTAGGGGTAGCCACCGCCCTGGCCGCCGCCGTCCTCCACGTGGGGTCTTC;(SEQ ID NO: 94) and the 3′ section of R with respective primers: Aforward,5′-CCTACGACGTGCCCGACTACGCCAGCCTGGGCCAGCAGGTGGAGGCGACGGCGGCCTGGTGGAGATCCGCA;(SEQ ID NO: 95) B forward,5′-CCTACGACGTGCCCGACTAGCCAGCCTGGGCCAAGCAGGTGGAGGCGACAAGTTCGTGTACCGCGTGGAGT;(SEQ ID NO: 96) C forward,5′-CCTACGACGTGCCCGACTACGCCAGCCTGGGCCAAGCAGGTGGAGGCAACGGCGTGCTGGTCAGGCGAGGTGA;(SEQ ID NO: 97) D forward,5′-CCTACGACGTGCCCGACTACGCCAGCCTGGGCCAAGCAGGTGGAGGCAGCGGCAAGTACTACAGCTGCCACA;(SEQ ID NO: 98) E forward,5′-CCTACGACGTGCCCGACTACGCCAGCCTGGGCCAAGCAGGTGGAGGCGTGGTGAAGGAGTTCCCCAGCTACC;(SEQ ID NO: 99) F forward,5′-CCTACGACGTGCCCGACTACGCCAGCCTGGGCCAAGCAGGTGGAGGCTTCGTGGAGCAGCACGAGACCGCCA.(SEQ ID NO: 100) The PCR sewed fragments were put into the EcoRl/Notlsites of pEf.

The bacterial expression vector for purification of Ptilosarcus GFP wascreated by PCR amplification of pP with primers:

forward, 5′-AGATCATAGATCTATGGGCAACCGCAACGTGCTGAAGAACACCGGCCTG; (SEQ IDNO: 101)P reverse, shown above.Digestion of the fragment with BgIII/NotI and ligation into theBamHI/NotI restriction sites of pGEX6P-1 (Pharmacia Biotech, Piscataway,N.J.). The vector containing R. muelleri GFP with C10G and C35Emutations (observed to aid in the folding of the protein in bacteria)was created by PCR sewing together a fragment created by annealing andextending primers:

forward, (SEQ ID NO: 102)5′-AGATCATAGATCTGAATTCATGGGCAGCAAGCAGATCCTGAAGAACACCGGCCTGCAGGAGGTGATGAGCTACAAGGTGACCTGGAGG;reverse, (SEQ ID NO: 103)5′-GCCAACAGGATGTTGCCCTTGCCCTCGCCCTCCATGGTGAACACGTGGTTGTTAACGATGCCCTCCAGGTTCACCTTGTAGCTCATCAC;R reverse, shown above.The sewed product was digested BgIII/NotI and ligated into theBamHI/NotI sites of pGEX6P-1.Cells and Retrovirus Transduction

Phoenix E retroviral packaging cells, described in Swift, S. et al.Current Protocols in Immunology, John Wiley and Sons, Inc., New York,Vol. 10.17C, pg 1–17, 1999, were carried in 10% fetal bovine serum with1% penicillin-streptomycin and Dulbecco's modified Eagle media(Mediatech Cellgro, Herndon, Va.). Jurkat E cells stably expressing theecotropic receptor (Jurkat E) were carried in 10% fetal calf serum with1% penicillin-streptomycin in RPMI 1630 media (JRH Bioscience,Williamsburg, Va.). Calcium phosphate transfection of Phoenix E cellsand infection of Jurket E cells and infection of Jurket E cells wascarried out as described in Swift et al.

Gel Filtration

Gel filtration was carried out on a 1×30 cm Pharmacia Superdex 75column, equilibrated in phosphate buffered saline and eluted at 0.3ml/min. at 22° C. The column was on a Hewlett-Packard 1100 HPLC systemequipped with a standard fluorescence detector with an 8 μl flow cell.GFP peaks were detected by absorption at 489 nm or by fluorescenceemission at 512 nm. Fluorescence excitation spectra were recorded with afixed emission wavelength at 549 nm, and emission spectra were recordedat a fixed excitation wavelength of 450 nm.

FACS and Microscopy

Flow-cytometry analysis and cell sorting of GFP expressing cells wereperformed on a FACScan (Beckton-Dickson, San Jose, Calif.) or MoFlo(Cytomation, Fort Collins, Colo.) instrument, and data analyzed usingFloJo software (Treestar Software, San Carlos, Calif.). Live cells weregated by scatter and propidium iodide staining during data analysis. GFPfluorescence intensity measurements (Geometric mean) were of GFPpositive cells only. Cells expressing GFP were visualized using NikonEllipse TE300 fluorescence microscope.

Immunoprecipitation and Western Analysis

For preparation of whole-cell lysates, cells were counted, collected,washed in PBS, and lysed by freeze-thaw/vortexing in lysis buffer (50 mMHEPES pH 7.4, 150 mM NaCl, 5 mM EDTA, 5 mM EGTA, 1% Triton X-100) withadded complete protease inhibitor cocktail (Boehringer Mannheim,Chicago, II). Lysate cleared by centrifugation was resolved on 4–12%NuPage SDS polyacrylamide gels (Novex, San Diego, Calif.) as per themanufacturer's recommendations. For immunoprecipitations, antibodyconjugated agarose beads were added to the cell lysate and incubated for4 h. The beads were washed in lysis buffer and samples separated by SDSPAGE as above. Samples transferred to PVDF membranes were blockedovernight at 4° C. using PBS buffer containing 10% Milk, 0.1% Tween20.Primary antibodies (polyclonal flag-probe, Santa Cruz Biotechnology,Santa Cruz, Calif.) were used at a 1:2000 dilution while secondaryantibodies were used at a 1:5000 dilution. Membranes were developedusing ECL plus enhanced chemiluminescence kit (Amersham Pharmacia,Piscataway, N.J.) and Hyperfilm ECL film (Amersham Life Sciences,Buckinghamshire, UK). For comparative Western blot analysis, GFPscontaining a C-terminal flag were used. Exposed film was scanned with aHewlett Packard (Palo Alto, Calif.) ScanJet 4C scanner and bandintensities were integrated using the program NIH Image (seehttp://rsb.info.nih.gov/nih-image/about.html).

GFP Purification from E. coli

All components used for purification of the GFP gene products were fromPharmacia Biotech (Piscataway, N.J.) except as noted. The humancodon-optimized gene for each protein was expressed in BL21 TIL codonplus (DE3) E. coli (Strategene, San Diego, Calif.) as a fusion proteinwith glutathione S-transferase from pGEX6p-1 derived vectors. Eachprotein was purified using Glutathione Sepharose 4B beads as per themanufacturer's directions, and the mature GFP was removed from theprotein with Precision Protease. The purified proteins ran as singlebands by SDS-PAGE and appeared as single peaks of the expected molecularmass by MALDI-TOF mass spectometry on a Bruker Reflex III instrument(Bruker Daltonics, Billerica, Mass.). Due to the cloning strategy,purified R. muelleri GFP has the amino acids PLGSEF- and Ptilosarcus GFPthe residues GPLGS-fused to their N-termini. Purified recombinant EGFPwas from Clontech (Palo Alto, Calif.).

CD Studies

CD spectra were recorded as described in Gururaja, T. L. et al. (2000)Chem Biol 7: 515–27. CD spectra were recorded on an AVIV 62A DS CDspectrophotometer (Lakewood, N.J.) equipped with a Peltier temperaturecontrol unit. The temperature of the instrument was maintainedconstantly below 20° C. using a Neslab CFT-33 refrigerated recirculatorwater bath. The device was periodically calibrated with the ammoniumsalt of (+)-10-camphorsulfuric acid according to manufacturersrecommendations. Spectra were recorded between 200 and 250 nm at 0.2 nmintervals with a time constant of 1 s at 25° C. in 10 mM phosphatebuffer containing 100 mM KF at pH 7.5. A cylindrical quartz cell of pathlength 0.1 cm was used for the spectral range with the sampleconcentration of 5 to 10 uM as determined by Lowry, O. et al. (1951) J.Biol. Chem. 193: 265–275. Mean residual ellipticity (MRE) is expressedin deg.cm²/dmol. The thermal denaturation was measured at 218 nm over arange of 4–98° C. with a temperature step of 2° C., a 2 minuteequilibration time, and a 60 s signal averaging time. The T_(m) dataswere fitted to a logistic sigmoid equation using the Levenberg-Marquardtalgorithm in Ultrafit (Biosoft, Cambridge, UK). CD spectra weredeconvoluted with the program CDNN (CD neural network) downloaded fromhttp://bioinformatik.biochemtech.uni-halle.de/cdnn/index.html.

Example 2 Expression of Renilla GFP Codon Optimized for Expression inHuman Cells

Renilla muelleri and Ptilosarcus GFP genes were constructed with aglycine following the initial methionine to optimize translations (seeExperiment 1). The sequences were codon optimized for efficientexpression in human cells. These GFPs were introduced into Jurkat-Ecells by retroviral delivery using the protocol of Swift et al., supra.Based on FACS analysis of scatter and propidium iodide staining of cellpopulations from 13 hours to 8 days post infection, there was noobserved toxicity of either Ptilosarcus or Renilla GFP. By 2 days postinfection, the accumulation of intracellular GFP slowed to a steadystate level. Based on FL1 channel fluorescence, the rate for reachingthe steady state level occurred more rapidly for Ptilosarcus and RenillaGFPs than for EGFP. The excitation and emission spectra were 501 and 511nm; 498 and 509 nm; and 489 and 510 nm for Ptilosarcus GFP, Renilla GFP,and Aequoria GFP, respectively.

The relative levels of wild type and codon optimized Renilla GFP andEGFP were analyzed by FACS at 4 days post infection. Based on geometricmean fluorescence values in the FL1 channel, codon optimized Renilla GFPwas expressed greater than 28 fold higher than wild-type cDNA sequence,and was 1.4 fold brighter than EGFP.

Ptilosarcus GFP, Renilla GFP and EGFP fused at their carboxy termini toa linker-flag tag sequence, EEAAKA-DYKDDDDK (SEQ ID NO: 104), wereexpressed in Jurkat-E, and their fluorescence levels compared by FACS.The Ptilosarcus and Renilla GFPs were on average 1.4 fold and 1.2 foldmore fluorescent than EGFP, respectively. Lysates from 2.8×104 Jurkat-Ecells, sorted 8 days after infection for GFP fluorescence, were comparedby Western blot using ant-flag antibody. All GFPs gave only a singleband. EGFP migrated at a slightly higher molecular mass than the othertwo flag-tagged antibodies. The integrated intensity values derivedusing NIH Image were 3200, 3206, and 2314 for each band, and had ratiosof 1.4: 1.4: 1.0 for Ptilosarcus GFP, Renilla GFP and EGFP,respectively. Thus, both Renilla and Ptilosarcus are expressed atslightly higher levels that EGFP in these cells, making these codonoptimized construct efficient reporter proteins.

Example 3 Epitope Tag Insertion for Loop Suitable for Presentation ofPeptides

To test for the location of potential surface loops in Renilla GFP, thepeptide sequence GQGGGYPYDVPDYASLGQAGGG (SEQ ID NO: 105) containing theinfluenza hemagglutinin epitope tag flanked by two flexible linkersequences was inserted into candidate sites corresponding to putativeloops of Renilla GFP (see Experiment 1). Following retroviral deliveryinto human cells, the fluorescence of the modified GFPs were examined.Six different insertion sites, A–F were tested in codon optimized GFP.FIG. 7 shows the fluorescence of the different modified Renilla GFPsretrovirally expressed in Jurkat-E cells and analyzed by FACS 4 dayspost infection. The geometric mean fluorescence values for thepopulations indicated by the gates are shown in the upper right cornerfor each FACS plot. Comparisons of these values are for samples thathave populations present within the same dynamic range. All modifiedRenilla GFPs, except that with insertion into position A, were expressedand fluorescent. The rank order of fluorescence intensities wasD>F>>B>E=C. Relative to the unaltered Renilla GFP, the averageexpression levels of Renilla GFP with the HA peptide positions D and Fwere ca. 49% and 47%, and B, C, and E less than 1%. Thus, Renilla GFPwith HA tags inserted into positions D and F best tolerate insertion ofthe 22 mer peptide. Renilla with the position D insertion was on average2.3 fold more fluorescent than Aequoria EGFP with the identical 22 merpresent in its most fluorescent loop (Peelle et al. (2001) Chem. Biol.8: 521–534). Comparison of insertions of tag peptides between Aequoriaand Renilla shows the most significant difference in position F. InEGFP, this analogous site is a loop between two twisted beta strandswith a distance across the top of the loop of ca 11 Å. An 8mer peptideinserted into Aequoria GFP at the equivalent position F is only 0.6% asfluorescent as the parent GFP when expressed in yeast (Abedi, et al(1998), whereas insertion of a 22 mer HA tag into position F of RenillaGFP retains 32% of the parent fluorescence. Thus, the Renilla structureappears to be significantly more tolerant that Aequoria GFP to insertionof peptides into this particular site. Although the position F site islikely to be surface exposed in both GFPs, its structure or significancein the folding pathway of Renilla GFP may differ from Aequoria GFP.

Example 4 Intracellular Presentation of a Peptide on a Renilla GFPScaffold

To examine Renilla GFP as a peptide display scaffold, the SV40 derivednuclear localization signal (NLS)-PPKKKRKV-(SEQ ID NO: 106) flanked byglycine linkers used in the epitope tag scan was inserted into sites Dand F. This NLS peptide interacts with karyopherins in the nuclear porecomplex for transport into the nucleus (Radu, et al. (1995) Cell 81:215–22; Rexach and Blobel (1995) Cell 83: 683–92; Moroianu, et al.(1995) Proc. Natl. Acad. Sci. USA 92: 2008–11). About 10⁶ A549 cellswith retrovirally expressed Renilla site D or F inserted peptide weregrown for 14 days and then observed by fluorescence microscopy. The HAepitope tag flanked by 4 glycines, G4YPYDVPDYASLG4-(SEQ ID NO: 107) wasinserted along with the linker residues as a control for eachexperiment. GFP with this tag inserted in both site D and F fluorescedthroughout the cell, while the NLS containing insert showed only nuclearfluorescence, with some preferential localization to intra-nuclearstructures for the loop D insert. The inserted peptide is thus solventexposed and can functionally interact with its target in the cell. Thus,the use of Renilla GFP as a scaffold allows use of additional GFPpeptide display sites, with possibly a different structural bias forphenotypic screening of peptide libraries.

1. A polynucleotide encoding a Renilla green fluorescent protein (GFP),wherein said polynucleotide comprises the nucleotide sequence of SEQ IDNO:
 1. 2. The polynucleotide of claim 1, further comprising a promoteroperably linked to said polynucleotide.
 3. The polynucleotide of claim1, further comprising a nucleic acid of interest.
 4. The polynucleotideof claim 3, wherein said nucleic acid of interest encodes a reporterprotein or selectable marker.
 5. The polynucleotide of claim 3, whereinsaid nucleic acid of interest encodes a random peptide.
 6. Thepolynucleotide of claim 3, wherein said nucleic acid of interest is acDNA.
 7. The polynucleotide of claim 1, wherein said Renilla GFP is afusion protein.
 8. The polynucleotide of claim 7, wherein said RenillaGFP contains an insertion.
 9. A vector comprising the polynucleotide ofclaim
 1. 10. The vector of claim 9, wherein said vector is a retroviralvector.
 11. The vector of claim 9, further comprising an IRES sequence.12. A cell comprising the vector of claim 9.